Skip to content

gojira69/Next-Word-Prediction

Repository files navigation

Next Word Prediction

This repository contains a language modeling project using Feedforward Neural Networks (FFNN), Recurrent Neural Networks (RNN), and Long Short-Term Memory (LSTM) models for text prediction. The models are trained on classic literature corpora and can predict the next word in a given sequence.

Features

  • Corpus Cleaning: Preprocessing text from "Pride and Prejudice" and "Ulysses."
  • Tokenization: Converts text into structured tokens for model training.
  • Dataset Preparation: Supports both sequence-based and N-gram-based training.
  • Multiple Language Models: Implements FFNN, RNN, and LSTM for comparison.
  • Next-Word Prediction: Predicts the most probable next words based on input.

Installation

  1. Clone the repository:

    https://github.com/gojira69/Next-Word-Prediction.git
    cd Next-Word-Prediction
  2. Install dependencies:

    pip install torch numpy

Usage

Run the script interactively:

python generator.py

Follow the prompts:

  • Enter the path to the text corpus.
  • Choose the language model type (f for FFNN, r for RNN, l for LSTM).
  • Provide the number of next-word candidates.
  • Input a sentence to get predictions.

Model Details

Feedforward Neural Network (FFNN)

  • Uses word embeddings and dense layers.
  • Predicts the next word based on N previous words. (N = 3, 5)
  • Implemented in FFNN class.

Recurrent Neural Network (RNN)

  • Uses recurrent layers to capture sequential patterns.
  • Trained using entire tokenized sentences.
  • Implemented in RNN class.

Long Short-Term Memory (LSTM)

  • Improves RNN by handling long-range dependencies.
  • More effective for structured text.
  • Implemented in LSTM class.

Upon analysis on the corpus themselves, the sentence lengths were very different. The Ulysses corpus had much frequent and longer sentences of the order of 103 and 104 when compared to the Pride and Prejudice corpus.

alt text alt text

Thus, for computational purposes, sentences which had tokens less than 100 were considered for RNNs and LSTMs.


File Structure

.
├── generator_clean.py   # Main script for training and prediction
├── pretrained_models/   # Directory to store pre-trained models
└── README.md            # Documentation

pretrained models are available here.


Example Prediction

Input: "It is a truth universally"
Output:
1. acknowledged (0.75)
2. known (0.12)
3. accepted (0.05)

Hyperparameter Tuning

For hyperparameter tuning, it is advisable to use the python notebooks of the 3 different neural architectures to try different hyperparameters. Logging, Model Saving, etc, are all implemented.

Future Work

  • Support for more complex architectures like Transformer-based models.
  • Addition of more diverse text corpora.
  • Optimization for larger datasets.

About

Next Word Prediction using Different Neural Architectures

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published