GitHub - ayushjain19/Domain_Specific_Search_Engine: A Search Engine in Python for the tech documents scraped from web

Design:

The Domain Specific Search Engine is specific towards "Mobile Phone Related Document Searches". It is designed into three basic parts:

Corpus collection: The file named web_scraper.py scrapes the data from the website https://www.theverge.com/mobile/archives/ and download the web pages. The web pages are then traversed, extracting the necessary text.
Creating Inverted Index: This task is handled by the file named create_inverted_index.py The corpus is traversed and tf-idf values are stored in the hash tables. Data Structure: The basic data structure used to store the inverted index is hash table of hash tables. Inner hash table reflects the document number and the outer hash table reflects the words Afer fully forming the data structure, it is saved as a .pickle file for further search querries
Search User is asked to give a search string .pickle file saved above is used to access the inverted index of the corpus Top documents are ranked accordingly

Running time:

Following two files carry out the task for the given domain specific search engine:

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
ask_query.py		ask_query.py
ask_query_with_gui.py		ask_query_with_gui.py
create_inverted_index.py		create_inverted_index.py
web_scraper.py		web_scraper.py

Provide feedback