This project implements a scalable document ranking and recommendation system designed to address the limitations of traditional keyword-based search engines.
By leveraging Apache Hadoop for distributed processing and sentence embeddings for semantic similarity, the system provides context-aware search results from massive document repositories.
Conventional Boolean search models suffer from poor flexibility and ranking quality, as they rely on direct keyword matches without understanding semantic relationships. With the explosive growth of unstructured data, there is a critical need for context-sensitive ranking models that improve retrieval effectiveness.
This project shifts the paradigm of search towards semantic similarity-based ranking, improving user experience and relevance across large-scale document libraries.
- Traditional search engines only match keywords, failing to capture contextual meaning.
- Results may include irrelevant documents while missing conceptually related ones.
- With rapidly expanding repositories, conventional models struggle with scalability and efficiency.
Goal: Develop a semantic search and recommendation system that uses sentence embeddings and TF-IDF weighting to rank documents based on conceptual similarity, not just keyword overlap.
- Convert documents and queries into high-dimensional embeddings to capture semantic meaning.
- Use TF-IDF with cosine similarity for ranking document relevance.
- Employ Apache Hadoop to handle distributed computation and large-scale processing.
- Retrieve and rank the top N most relevant documents for user queries.
- Data stored and processed in HDFS (Hadoop Distributed File System).
- Apache Hadoop in Docker used for scalable distributed computation.
- Workflow:
- Load
.txtdocuments into HDFS. - Mapper: Tokenizes text → generates
(word, 1). - Reducer: Aggregates term frequencies (TF).
- Second Mapper: Computes inverse document frequency (IDF).
- TF-IDF vectors created and stored in HDFS.
- Query processed into a TF-IDF vector.
- Cosine similarity used to compare query vs. documents.
- Results sorted → top N documents retrieved with ID, title, and rank.
- Load
- Upload documents into HDFS.
- Preprocess with MapReduce (TF-IDF computation).
- Build inverted index for document-term mapping.
- Accept user query and compute its TF-IDF representation.
- Compare query vector with document vectors using cosine similarity.
- Rank and return top N most relevant documents.
- Semantic search with embeddings.
- Distributed processing using Hadoop + Docker.
- Efficient TF-IDF ranking with cosine similarity.
- Improved user experience in large-scale search applications.