This project is part of the CS6320 - Natural Language Processing course at UT Dallas. The objective is to build and evaluate an n-gram-based language model to predict the next word in a sequence, demonstrating fundamental concepts of statistical language modeling.
- Aankit Das
- Kedar Kadu
Group 21
The key goals of this project are to:
- Implement a language model using n-grams (where n = 1 (unigram) and n = 2 (bigram)).
- Preprocess textual data to ensure the language model functions effectively.
- Explore the effects of different values of
n
in generating text and predicting word sequences. - Evaluate the model using various metrics, such as perplexity, for assessing language model performance.
-
Text Preprocessing:
- Tokenization and cleaning of text data (removal of punctuation, stopwords, etc.).
- Conversion of text to lower case for uniformity.
-
N-Gram Model Construction:
- Use bigrams, and trigrams for the language model.
- Compute n-gram probabilities by calculating the frequency of sequences in the training data.
-
Smoothing Techniques:
- Apply smoothing techniques such as Laplace smoothing, Add-k smoothing, and Kneser-Ney smoothing to handle unseen word sequences during testing.
-
Evaluation:
- Test the model on a validation set and compute performance metrics like perplexity and accuracy.
- Compare results across different n-gram orders and smoothing techniques.
-
A1_DATASET/
Contains the training and testing datasets used for the language model. -
model.ipynb
Python notebook that implements the n-gram model, text preprocessing, and evaluation. -
README.md
This file, providing an overview of the project.
- Clone the repository:
git clone https://github.com/your-repo/n-gram-language-model.git cd n-gram-language-model
- Install dependencies:
Install necessary Python libraries (e.g., nltk, numpy, etc.) by running:
pip install -r requirements.txt
- Run the n-gram model: After installing the necessary packages, we can run the Python notebook