This project implements a neural network model to predict the next words in a sequence, enabling it to generate text that continues an input seed text. The model is trained on a text corpus, tokenized, and converted into numerical sequences for learning. The architecture uses embeddings, LSTMs, and feed-forward layers to predict multiple next words in a sequence.
This neural network (NN) predicts the following words in a text sequence (incomplete sentence). It accepts a phrase and continues it as long as needed, setting appropriate punctuation. The purpose of this NN is:
- Test whether a NN can instantly fool the software aimed to detect AI-generated texts.
- A demonstrative and simple example of natural language processing NN
- Entertain. The NN produces funny stories, making you think if this is real.
Here is an example of a NN-generated story:
In the early twentieth century, it was suggested that to develop a consistent understanding of the fundamental concepts of mathematics, it was sufficient to study observation. For example, a single electron in an unexcited atom is classically depicted as a particle moving in a circular path around the atomic nucleus, whereas in quantum mechanics it is described by a static wave function surrounding the nucleus. For example, the electron wave function for an unexcited hydrogen atom is a spherically symmetric function known as the s-orbital (Fig.
The model is trained on a text corpus (generated by Extract_wiki_text_content.py
), tokenized, and converted into numerical sequences for learning. The architecture uses embeddings, LSTMs, and feed-forward layers. Note that NN uses its own tokenization instead of the nltk package, allowing its potential users to inspect its machinery.
- Corpus Loading: The dataset created by
Extract_wiki_text_content.py
is loaded using Python'spickle
module. - Tokenizer:
- A custom tokenizer preprocesses text by adding spaces around punctuation and mapping words to unique indices.
- The
Tokenizer
class includes methods to preprocess text, fit the tokenizer on a corpus, and convert text to sequences of indices.
-
TextDataset:
- Converts the tokenized corpus into input-output pairs for training. For each sequence,
n-gram
sequences are created where a portion of the sequence is input, and the subsequent tokens are the target for prediction. - The dataset supports multi-word prediction through a
predict_steps
parameter.
- Converts the tokenized corpus into input-output pairs for training. For each sequence,
-
DataLoader:
- Handles batching, shuffling, and padding sequences to ensure that batches can be efficiently processed by the model. A custom
collate_fn
function is used for padding.
- Handles batching, shuffling, and padding sequences to ensure that batches can be efficiently processed by the model. A custom
The NextWordPredictor model is designed to handle multi-word predictions and consists of the following components:
- Embedding Layer:
- Converts input tokens into dense vector representations of size
embed_size
.
- Converts input tokens into dense vector representations of size
- LSTM:
- A two-layer LSTM processes the input embeddings, capturing temporal dependencies in the sequence.
- Layer Normalization:
- Normalizes the output of the LSTM's final hidden state for improved stability.
- Feed-Forward Layers:
- A series of fully connected layers (optionally with BatchNorm) process the hidden state to generate predictions.
- Final Linear Layer:
- Outputs a tensor of shape
(batch_size, predict_steps, vocab_size)
containing predictions for multiple words.
- Outputs a tensor of shape
- Custom Weight Initialization:
- Xavier initialization is used for weights, and biases are initialized to zero for better convergence.
A custom loss function (multi_word_loss
) computes the average cross-entropy loss across the predicted steps.
- The model generates text by recursively predicting the next tokens for a given seed text.
- Predictions are translated back to words using the tokenizer's
index_word
dictionary.
- The training loop uses the Adam optimizer with a learning rate of
0.0001
and trains the model over multiple epochs. - The average loss is logged for each epoch.
- Periodically, the model's performance is tested by generating text from sample seed phrases.
Run the provided script to:
- Load the dataset.
- Train the model on the dataset.
- Periodically save checkpoints and generate text predictions.
After training, you can generate new stories by providing a seed text to the predict_sequence
function.
- Handles multi-word predictions.
- Customizable architecture:
- Embedding size, LSTM size, feed-forward layers, and more can be adjusted.
- Flexible tokenizer with preprocessed text.
- Trains efficiently using
DataLoader
with padding support.
- Python 3.8+
- PyTorch
- tqdm
- scikit-learn
- numpy
- Anaconda (recommended for managing the environment)
Save the following YAML file as environment.yml
, and run conda env create -f environment.yml
to create the environment.
name: story_gen
channels:
- defaults
- conda-forge
dependencies:
- python=3.8
- pytorch=1.10
- torchvision
- torchaudio
- pytorch-cuda=11.3
- tqdm
- scikit-learn
- numpy
- pyyaml
- pip
- pip:
- wikipedia-api
conda activate story_gen
Once the environment is activated, you can run the script:
python story_telling_nn.py