You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a Python project that performs tokenization, stop word removal, positional indexing, phrase query searching, term frequency-inverse document frequency (TF-IDF) calculation, cosine similarity computation, and document ranking. The project consists of three parts, each of which is described in detail below.
Part 1: Tokenization and Stop Word Removal
This part reads 10 text files and applies tokenization to each file to split it into individual words. Stop words are also removed from the text, except for the words "in" and "to".
Part 2: Positional Indexing and Phrase Query Searching
The second part of the project builds a positional index for the text files and displays each term with the number of documents containing the term, as well as the positions of the term in each document. The system also allows users to search for a phrase in the text using the positional index, and returns the documents that match the query.
Part 3: TF-IDF Calculation, Cosine Similarity Computation, and Document Ranking
The final part of the project computes the term frequency and inverse document frequency for each term in each document, and displays the resulting TF-IDF matrix. The system then computes the cosine similarity between the query and each matched document, and ranks the documents based on their similarity to the query.