The class BooleanModel.py
implements a toy search engine to illustrate the boolean retrieval model for text documents.
The program asks you to enter a search query, and then returns all documents matching the query (exact match), in no particular order (unranked retrieval).
The document corpus consists of documents, which are short stories downloaded from here.
- Install Python 3.6+
- Install all pip requirements from the
requirements.txt
:
$ python3 -m pip install -r requirements.txt
- To download stopwords used for the model, open your terminal or command prompt and enter following commands:
$ python3
>>> import nltk
>>> nltk.download('stopwords')
# Import boolean model
from BooleanModel import BooleanModel
# Create a model on your corpus of documents by passing it's path as an argument
model = BooleanModel("./corpus/*")
# Query on it as many times as you like
results = model.query("book")
# results = ['Freeway Chase Ends at Newsstand.txt', 'A Festival of Books.txt']
# Querying on a word which is not in the corpus
results = model.query("pikachu")
# Warning: pikachu was not found in the corpus!
# results = []
- Single term =>
ash
- AND =>
ash & may
- OR =>
ash | may & brown
- Parenthesis =>
( ash | may ) & brown
- NOT =>
( ~ash | may ) & brown
Precedence: NOT (~) > AND (&) > OR (|)
- NOT operator on an intermediate result =>
~( ash | may ) & brown
- Spaces between NOT operator and operand =>
~ ash & may
-
Preprocessing to build standard inverted index
- Remove special characters
- Remove digits
- Tokenize
- Lowercasing
- Stemming using
PorterStemmer
- Add unique words and their postings to the index
-
Refer to this for the internals of boolean model and query evaluation
- In case of start byte invalid errors, check for character encodings of the documents in corpus. (Currently,
utf-8
is used.)
MIT