Analyzed a corpus containing 30 .txt files and retrieved the most relevant document for a given query using Python and NLTK
presidential_debates folder contains a collection of general election presidential debates from 1960 to 2012. Each of the 30 files contains the transcript of a debate and is named by the date of the debate.
- Read the 30 .txt files.
- Tokenized the contents of the files.
- Performed stopword removal on the obtained tokens.
- Performed stemming on the obtained tokens.
- Computed the TF-IDF vector for each document.
- Given a query string, calculated the query vector.
- Returned the document which results in the highest cosine similarity score. Constructed and used a posting list (document, TF-IDF weights) for each token in the corpus.