Big Data and NLP：Inverted Index Database for 19,000 Reuters News Articles

This project is about implementing an inverted index using Apache Spark（Pyspark）to build a relational database (SQLite) for 19,000 Reuters News Articles.Storing the index in a database offers the benefit of using the B-Tree data structure offered by a relational database instead of building it from the scratch.

Natrual lanaguage processing is applied to clean the text and invert the HTML text files into tf-idf index using Python libraries(nltk,re, bs4, collections). Two datasets are given; a real one from Reuters which contains more than 19,000 documents, and a small sample of 5 documents in order to help with testing the code.

Interface for keyword searching and ranking the most relevant results by TF-IDF

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Big Data and NLP：Inverted Index Database for 19,000 Reuters News Articles

Interface for keyword searching and ranking the most relevant results by TF-IDF

Files

README.md

Latest commit

History

README.md

File metadata and controls

Big Data and NLP：Inverted Index Database for 19,000 Reuters News Articles

Interface for keyword searching and ranking the most relevant results by TF-IDF