Boolean Retrieval Model

The class BooleanModel.py implements a toy search engine to illustrate the boolean retrieval model for text documents.

The program asks you to enter a search query, and then returns all documents matching the query (exact match), in no particular order (unranked retrieval).

The document corpus consists of documents, which are short stories downloaded from here.

Getting Started

Install Python 3.6+
Install all pip requirements from the requirements.txt:

$ python3 -m pip install -r requirements.txt

To download stopwords used for the model, open your terminal or command prompt and enter following commands:

$ python3
>>> import nltk
>>> nltk.download('stopwords')

Usage

# Import boolean model
from BooleanModel import BooleanModel

# Create a model on your corpus of documents by passing it's path as an argument
model = BooleanModel("./corpus/*")

# Query on it as many times as you like
results = model.query("book")

# results = ['Freeway Chase Ends at Newsstand.txt', 'A Festival of Books.txt']

# Querying on a word which is not in the corpus
results = model.query("pikachu")

# Warning: pikachu was not found in the corpus!
# results = []

Queries

Supported Queries

Single term => ash
AND => ash & may
OR => ash | may & brown
Parenthesis => ( ash | may ) & brown
NOT => ( ~ash | may ) & brown

Precedence: NOT (~) > AND (&) > OR (|)

Unsupported Queries

NOT operator on an intermediate result => ~( ash | may ) & brown
Spaces between NOT operator and operand => ~ ash & may

Methodology

Preprocessing to build standard inverted index
- Remove special characters
- Remove digits
- Tokenize
- Lowercasing
- Stemming using PorterStemmer
- Add unique words and their postings to the index
Refer to this for the internals of boolean model and query evaluation

Note

In case of start byte invalid errors, check for character encodings of the documents in corpus. (Currently, utf-8 is used.)

Authors

Mayank Jain

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
corpus		corpus
.gitignore		.gitignore
BooleanModel.py		BooleanModel.py
LICENSE		LICENSE
README.md		README.md
Stack.py		Stack.py
query.py		query.py
requirements.txt		requirements.txt
runner.py		runner.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Boolean Retrieval Model

Getting Started

Usage

Queries

Supported Queries

Unsupported Queries

Methodology

Note

Authors

License

About

Contributors 2

Languages

License

mayank-02/boolean-retrieval-model

Folders and files

Latest commit

History

Repository files navigation

Boolean Retrieval Model

Getting Started

Usage

Queries

Supported Queries

Unsupported Queries

Methodology

Note

Authors

License

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages