Neural Network for Story Generation

Overview

This project implements a neural network model to predict the next words in a sequence, enabling it to generate text that continues an input seed text. The model is trained on a text corpus, tokenized, and converted into numerical sequences for learning. The architecture uses embeddings, LSTMs, and feed-forward layers to predict multiple next words in a sequence.

This neural network (NN) predicts the following words in a text sequence (incomplete sentence). It accepts a phrase and continues it as long as needed, setting appropriate punctuation. The purpose of this NN is:

Test whether a NN can instantly fool the software aimed to detect AI-generated texts.
A demonstrative and simple example of natural language processing NN
Entertain. The NN produces funny stories, making you think if this is real.

Here is an example of a NN-generated story:

In the early twentieth century, it was suggested that to develop a consistent understanding of the fundamental concepts of mathematics, it was sufficient to study observation. For example, a single electron in an unexcited atom is classically depicted as a particle moving in a circular path around the atomic nucleus, whereas in quantum mechanics it is described by a static wave function surrounding the nucleus. For example, the electron wave function for an unexcited hydrogen atom is a spherically symmetric function known as the s-orbital (Fig.

How it works:

The model is trained on a text corpus (generated by Extract_wiki_text_content.py), tokenized, and converted into numerical sequences for learning. The architecture uses embeddings, LSTMs, and feed-forward layers. Note that NN uses its own tokenization instead of the nltk package, allowing its potential users to inspect its machinery.

Steps in the Pipeline

1. Data Preprocessing

Corpus Loading: The dataset created by Extract_wiki_text_content.py is loaded using Python's pickle module.
Tokenizer:
- A custom tokenizer preprocesses text by adding spaces around punctuation and mapping words to unique indices.
- The Tokenizer class includes methods to preprocess text, fit the tokenizer on a corpus, and convert text to sequences of indices.

2. Dataset and DataLoader

TextDataset:
- Converts the tokenized corpus into input-output pairs for training. For each sequence, n-gram sequences are created where a portion of the sequence is input, and the subsequent tokens are the target for prediction.
- The dataset supports multi-word prediction through a predict_steps parameter.
DataLoader:
- Handles batching, shuffling, and padding sequences to ensure that batches can be efficiently processed by the model. A custom collate_fn function is used for padding.

3. Model Architecture

The NextWordPredictor model is designed to handle multi-word predictions and consists of the following components:

Embedding Layer:
- Converts input tokens into dense vector representations of size embed_size.
LSTM:
- A two-layer LSTM processes the input embeddings, capturing temporal dependencies in the sequence.
Layer Normalization:
- Normalizes the output of the LSTM's final hidden state for improved stability.
Feed-Forward Layers:
- A series of fully connected layers (optionally with BatchNorm) process the hidden state to generate predictions.
Final Linear Layer:
- Outputs a tensor of shape (batch_size, predict_steps, vocab_size) containing predictions for multiple words.
Custom Weight Initialization:
- Xavier initialization is used for weights, and biases are initialized to zero for better convergence.

4. Loss Function

A custom loss function (multi_word_loss) computes the average cross-entropy loss across the predicted steps.

5. Inference

The model generates text by recursively predicting the next tokens for a given seed text.
Predictions are translated back to words using the tokenizer's index_word dictionary.

6. Training

The training loop uses the Adam optimizer with a learning rate of 0.0001 and trains the model over multiple epochs.
The average loss is logged for each epoch.
Periodically, the model's performance is tested by generating text from sample seed phrases.

Usage

Training the Model

Run the provided script to:

Load the dataset.
Train the model on the dataset.
Periodically save checkpoints and generate text predictions.

Generating Text

After training, you can generate new stories by providing a seed text to the predict_sequence function.

Features

Handles multi-word predictions.
Customizable architecture:
- Embedding size, LSTM size, feed-forward layers, and more can be adjusted.
Flexible tokenizer with preprocessed text.
Trains efficiently using DataLoader with padding support.

Dependencies

Python 3.8+
PyTorch
tqdm
scikit-learn
numpy
Anaconda (recommended for managing the environment)

YAML File: Anaconda Environment Configuration

Save the following YAML file as environment.yml, and run conda env create -f environment.yml to create the environment.

name: story_gen
channels:
  - defaults
  - conda-forge
dependencies:
  - python=3.8
  - pytorch=1.10
  - torchvision
  - torchaudio
  - pytorch-cuda=11.3
  - tqdm
  - scikit-learn
  - numpy
  - pyyaml
  - pip
  - pip:
      - wikipedia-api

Command to Activate the Environment

conda activate story_gen

Run the Script

Once the environment is activated, you can run the script:

python story_telling_nn.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Neural Network for Story Generation

Overview

How it works:

Steps in the Pipeline

1. Data Preprocessing

2. Dataset and DataLoader

3. Model Architecture

4. Loss Function

5. Inference

6. Training

Usage

Training the Model

Generating Text

Features

Dependencies

YAML File: Anaconda Environment Configuration

Command to Activate the Environment

Run the Script

Files

README.md

Latest commit

History

README.md

File metadata and controls

Neural Network for Story Generation

Overview

How it works:

Steps in the Pipeline

1. Data Preprocessing

2. Dataset and DataLoader

3. Model Architecture

4. Loss Function

5. Inference

6. Training

Usage

Training the Model

Generating Text

Features

Dependencies

YAML File: Anaconda Environment Configuration

Command to Activate the Environment

Run the Script