This project implements a hybrid search system for scientific articles, combining traditional full-text search with modern embedding-based semantic search. The system is designed to provide more relevant and diverse search results by leveraging the strengths of both search methods.
- Full-text search using Elasticsearch
- Semantic search using sentence embeddings (SentenceTransformer)
- Hybrid search combining both methods
- Evaluation metrics (MAP and NDCG) for search quality assessment
- Parameter optimization for improved performance
- Simple web interface for user interaction
- Python 3.8+
- Elasticsearch
- Flask
- SentenceTransformer
- pandas
- scikit-learn
- NumPy
scientific_article_search/
│
├── data/
│ ├── train_data.csv
│ └── test_data.csv
│
├── src/
│ ├── data_preprocessing.py
│ ├── full_text_search.py
│ ├── semantic_search.py
│ ├── hybrid_search.py
│ ├── evaluation.py
│ └── app.py
│
├── templates/
│ └── index.html
│
├── requirements.txt
└── README.md
-
Clone the repository:
git clone https://github.com/yourusername/scientific_article_search.git cd scientific_article_search
-
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows, use: venv\Scripts\activate
-
Install the required packages:
pip install -r requirements.txt
-
Install and run Elasticsearch (follow instructions from the official Elasticsearch documentation)
-
Prepare the data:
python src/data_preprocessing.py
-
Run the Flask application:
python src/app.py
-
Open a web browser and navigate to
http://localhost:5000
-
Enter a search query in the provided input field and click "Search" to see the results
To evaluate and optimize the search system:
python src/evaluation.py
This script will output the initial performance metrics and the optimized parameters for the hybrid search system.
- Implement more advanced NLP techniques (e.g., named entity recognition, topic modeling)
- Expand the dataset to cover a broader range of scientific domains
- Develop a more sophisticated user interface with advanced search options
- Implement user feedback mechanisms to continuously improve search results
- Explore cloud deployment options for scalability
Contributions to this project are welcome. Please fork the repository and submit a pull request with your proposed changes.
This project is licensed under the MIT License.