@@ Aim—@@
The aim of this project is to build a prototype of a search engine which will work on millions of wikipedia pages (which are in xml format) and retrieve the top 10 relevant Wikipedia documents that matches the input query of user. This takes Wikipedia corpus in XML format which is available at Wikipedia.org as input. Then it indices millions of Wikipedia pages involving a comparable number of distinct terms. Given a query, it retrieves relevant ranked documents and their titles using index.
This project uses OOPs concepts.
- Information about index files and the final treemap data structure is ready.
- We can rank the documents based on this parameter when user types a query i.e., the backend part is implemented.
- Next, one can go for implementing the GUI for this search engine i.e., the frontend part.
Directory structure
- .
- README.md
- SearchEngine
- SpamServerMinimal.iml
- pom.xml
- src
- main
- input
- java
- com
- soham
- searchengine
- config
- model
- search
- services
- util
- searchengine
- soham
- com
- output
- 0_index.txt
- 0_offsets.txt
- 0_secondry.txt
- 10_index.txt
- 10_offsets.txt
- 10_secondry.txt
- 11_index.txt
- 11_offsets.txt
- 11_secondry.txt
- 12_index.txt
- 12_offsets.txt
- 12_secondry.txt
- 13_index.txt
- 13_offsets.txt
- 13_secondry.txt
- 14_index.txt
- 14_offsets.txt
- 14_secondry.txt
- 15_index.txt
- 15_offsets.txt
- 15_secondry.txt
- 16_index.txt
- 16_offsets.txt
- 16_secondry.txt
- 17_index.txt
- 17_offsets.txt
- 17_secondry.txt
- 18_index.txt
- 18_offsets.txt
- 18_secondry.txt
- 19_index.txt
- 19_offsets.txt
- 19_secondry.txt
- 1_index.txt
- 1_offsets.txt
- 1_secondry.txt
- 20_index.txt
- 20_offsets.txt
- 20_secondry.txt
- 21_index.txt
- 21_offsets.txt
- 21_secondry.txt
- 22_index.txt
- 22_offsets.txt
- 22_secondry.txt
- 23_index.txt
- 23_offsets.txt
- 23_secondry.txt
- 24_index.txt
- 24_offsets.txt
- 24_secondry.txt
- 25_index.txt
- 25_offsets.txt
- 25_secondry.txt
- 26_index.txt
- 26_offsets.txt
- 26_secondry.txt
- 2_index.txt
- 2_offsets.txt
- 2_secondry.txt
- 3_index.txt
- 3_offsets.txt
- 3_secondry.txt
- 4_index.txt
- 4_offsets.txt
- 4_secondry.txt
- 5_index.txt
- 5_offsets.txt
- 5_secondry.txt
- 6_index.txt
- 6_offsets.txt
- 6_secondry.txt
- 7_index.txt
- 7_offsets.txt
- 7_secondry.txt
- 8_index.txt
- 8_offsets.txt
- 8_secondry.txt
- 9_index.txt
- 9_offsets.txt
- 9_secondry.txt
- allWords.txt
- resources
- main
- SeminarPresentation6920.pptx