Query-Search-using-TF-IDF-vectors-and-cosine-similarity

Analyzed a corpus containing 30 .txt files and retrieved the most relevant document for a given query using Python and NLTK

Dataset:

presidential_debates folder contains a collection of general election presidential debates from 1960 to 2012. Each of the 30 files contains the transcript of a debate and is named by the date of the debate.

Steps:

Read the 30 .txt files.
Tokenized the contents of the files.
Performed stopword removal on the obtained tokens.
Performed stemming on the obtained tokens.
Computed the TF-IDF vector for each document.
Given a query string, calculated the query vector.
Returned the document which results in the highest cosine similarity score. Constructed and used a posting list (document, TF-IDF weights) for each token in the corpus.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
presidential_debates		presidential_debates
P1_1001163569.py		P1_1001163569.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Query-Search-using-TF-IDF-vectors-and-cosine-similarity

Dataset:

Steps:

About

Releases

Packages

Languages

ranriy/Query-Search-using-TF-IDF-vectors-and-cosine-similarity

Folders and files

Latest commit

History

Repository files navigation

Query-Search-using-TF-IDF-vectors-and-cosine-similarity

Dataset:

Steps:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages