This repository contains a language modeling project using Feedforward Neural Networks (FFNN), Recurrent Neural Networks (RNN), and Long Short-Term Memory (LSTM) models for text prediction. The models are trained on classic literature corpora and can predict the next word in a given sequence.
- Corpus Cleaning: Preprocessing text from "Pride and Prejudice" and "Ulysses."
- Tokenization: Converts text into structured tokens for model training.
- Dataset Preparation: Supports both sequence-based and N-gram-based training.
- Multiple Language Models: Implements FFNN, RNN, and LSTM for comparison.
- Next-Word Prediction: Predicts the most probable next words based on input.
-
Clone the repository:
https://github.com/gojira69/Next-Word-Prediction.git cd Next-Word-Prediction -
Install dependencies:
pip install torch numpy
Run the script interactively:
python generator.pyFollow the prompts:
- Enter the path to the text corpus.
- Choose the language model type (
ffor FFNN,rfor RNN,lfor LSTM). - Provide the number of next-word candidates.
- Input a sentence to get predictions.
- Uses word embeddings and dense layers.
- Predicts the next word based on N previous words. (N = 3, 5)
- Implemented in
FFNNclass.
- Uses recurrent layers to capture sequential patterns.
- Trained using entire tokenized sentences.
- Implemented in
RNNclass.
- Improves RNN by handling long-range dependencies.
- More effective for structured text.
- Implemented in
LSTMclass.
Upon analysis on the corpus themselves, the sentence lengths were very different. The Ulysses corpus had much frequent and longer sentences of the order of 103 and 104 when compared to the Pride and Prejudice corpus.
Thus, for computational purposes, sentences which had tokens less than 100 were considered for RNNs and LSTMs.
.
├── generator_clean.py # Main script for training and prediction
├── pretrained_models/ # Directory to store pre-trained models
└── README.md # Documentation
pretrained models are available here.
Input: "It is a truth universally"
Output:
1. acknowledged (0.75)
2. known (0.12)
3. accepted (0.05)
For hyperparameter tuning, it is advisable to use the python notebooks of the 3 different neural architectures to try different hyperparameters. Logging, Model Saving, etc, are all implemented.
- Support for more complex architectures like Transformer-based models.
- Addition of more diverse text corpora.
- Optimization for larger datasets.

