Next Word Prediction

This repository contains a language modeling project using Feedforward Neural Networks (FFNN), Recurrent Neural Networks (RNN), and Long Short-Term Memory (LSTM) models for text prediction. The models are trained on classic literature corpora and can predict the next word in a given sequence.

Features

Corpus Cleaning: Preprocessing text from "Pride and Prejudice" and "Ulysses."
Tokenization: Converts text into structured tokens for model training.
Dataset Preparation: Supports both sequence-based and N-gram-based training.
Multiple Language Models: Implements FFNN, RNN, and LSTM for comparison.
Next-Word Prediction: Predicts the most probable next words based on input.

Installation

Clone the repository:

https://github.com/gojira69/Next-Word-Prediction.git
cd Next-Word-Prediction

Install dependencies:
```
pip install torch numpy
```

Usage

Run the script interactively:

python generator.py

Follow the prompts:

Enter the path to the text corpus.
Choose the language model type (f for FFNN, r for RNN, l for LSTM).
Provide the number of next-word candidates.
Input a sentence to get predictions.

Model Details

Feedforward Neural Network (FFNN)

Uses word embeddings and dense layers.
Predicts the next word based on N previous words. (N = 3, 5)
Implemented in FFNN class.

Recurrent Neural Network (RNN)

Uses recurrent layers to capture sequential patterns.
Trained using entire tokenized sentences.
Implemented in RNN class.

Long Short-Term Memory (LSTM)

Improves RNN by handling long-range dependencies.
More effective for structured text.
Implemented in LSTM class.

Upon analysis on the corpus themselves, the sentence lengths were very different. The Ulysses corpus had much frequent and longer sentences of the order of 10³ and 10⁴ when compared to the Pride and Prejudice corpus.

Thus, for computational purposes, sentences which had tokens less than 100 were considered for RNNs and LSTMs.

File Structure

.
├── generator_clean.py   # Main script for training and prediction
├── pretrained_models/   # Directory to store pre-trained models
└── README.md            # Documentation

pretrained models are available here.

Example Prediction

Input: "It is a truth universally"
Output:
1. acknowledged (0.75)
2. known (0.12)
3. accepted (0.05)

Hyperparameter Tuning

For hyperparameter tuning, it is advisable to use the python notebooks of the 3 different neural architectures to try different hyperparameters. Logging, Model Saving, etc, are all implemented.

Future Work

Support for more complex architectures like Transformer-based models.
Addition of more diverse text corpora.
Optimization for larger datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
corpus		corpus
.gitignore		.gitignore
README.md		README.md
ffnn.ipynb		ffnn.ipynb
generator.py		generator.py
lstm.ipynb		lstm.ipynb
sentence_length_papc.png		sentence_length_papc.png
sentence_length_uc.png		sentence_length_uc.png
v_rnn.ipynb		v_rnn.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Next Word Prediction

Features

Installation

Usage

Model Details

Feedforward Neural Network (FFNN)

Recurrent Neural Network (RNN)

Long Short-Term Memory (LSTM)

File Structure

Example Prediction

Hyperparameter Tuning

Future Work

About

Uh oh!

Releases

Packages

Languages

gojira69/Next-Word-Prediction

Folders and files

Latest commit

History

Repository files navigation

Next Word Prediction

Features

Installation

Usage

Model Details

Feedforward Neural Network (FFNN)

Recurrent Neural Network (RNN)

Long Short-Term Memory (LSTM)

File Structure

Example Prediction

Hyperparameter Tuning

Future Work

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages