This Python module is designed as an educational and intuitive module for understanding the fundamentals of Information Retrieval (IR). It provides simple implementations of basic IR techniques with minimal dependencies.
The module is intended for educational purposes and is not suitable for production environments.
- Simple Search and Ranking: Provides simple search and ranking functionality using the Vector Space Model.
- Automated Testing: Includes a suite of automated tests to ensure the reliability and correctness of the implementation.
- Minimal Dependencies: Includes only few dependencies (word stemming & tokenise) to keep the module lightweight and easy to understand.
- Progress Tracking: Utilizes the
tqdm
library to provide real-time progress updates during lengthy operations, enhancing user experience. - Detailed Logging: Incorporates a logging system to track the operations, helping in debugging and ensuring transparency of the process.
- Object-Oriented Design: The module is designed with a focus on modularity and extensibility with little abstraction (OOP).
- easy to extend with new features and new weighting models
- Optimisation Techniques: Some optimisation techniques added to speed up the computation process.
- cache inverse documents frequency
- use matrix multiplication for vector operations
- process larger documents first (the remaining documents for computation will be smaller and smaller)
Install all the dependencies using pip
within virtual environment:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# install dev dependencies
pip install -r requirements-dev.txt
Example of the data directory structure:
./data/
└── EnglishNews
├── News6.txt
├── News78.txt
├── News84.txt
├── News2994.txt
└── News3000.txt
Vector space model can be built by given collection of documents in the EnglishNews
directory.
Build vector space model in main.py
and run:
python main.py
# or with arguments
python main.py --sample-size 1000 --query "London BBC breaking news" --logging-level CRITICAL
Here's the code snippet to save and load the model, with joblib
library:
the joblib
library is particularly efficient for objects that carry large numpy arrays, which might be the case with a vector space model.
import joblib
vs = VectorSpace(
weighting_model=BM25(),
parser=Parser(stemmer=SnowballStemmer(language="english")),
logging_level=logging_level,
)
vs.build(documents_directory=files_path, sample_size=sample_size)
# saving the model to disk
joblib.dump(vs, os.path.join("vsm", "bm25_snwball_vs.joblib"))
# load model from disk
vs_loaded = joblib.load(os.path.join("vsm", "bm25_snwball_vs.joblib"))
vs_loaded.search("London BBC breaking news")
Check main.py
for examples.
Since Makefile
is provided, you can run all tests with:
# run all tests
make test
# run coverage
make cov
This project began as a part of a course on Web Search and Mining, taught by Professor Tsai at National Chengchi University (NCCU). I extend my heartfelt gratitude to Professor Tsai for his invaluable guidance and the insights that sparked the development of this module.
A special acknowledgment goes to the adage that reminds us that software does not merely get built; it grows.
TODO: UML diagram