bookbot

a project that reads the given file and uses a neural network to generate text that looks like from the book.

the built-in neural network is MLP, Wavenet-inspired Hierarchical MLP, and a GPT network that's built with pure Pytorch (from scratch) along with batch normalization layer and Kaiming initialization.

Thanks to Andrej Karpathy for his great course on deep learning.

Available file types as of the moment:

PDF
TXT

Usage

Installation

You can try the project out by cloning the git repository

git clone https://github.com/alperiox/bookbot.git

Then just install the poetry environment and move on to the next steps.

How to train the network?

Simply run the main.py by setting up the arguments below.

You can start the training using the script like in the following:

python main.py --file=romeo-and-juliet.txt --model gpt --max_steps 100

Or if you want to have more control over the whole training, consider using a more detailed configuration:

Argument	Default Value	Description
train_ratio	0.8	Ratio of the input data that will be used for training
file	-	Path to the PDF/TXT file
n_embed	15	Embedding vector's dimension
n_hidden	400	Hidden layer's dimensions (the hidden layers will be defined as n_hidden x n_hidden)
block_size	10	Block size to set up the dataset, it's our context window in this project
batch_size	32	The amount of samples that'll be processed in one go
epochs	10	Number of epochs to train the model
lr	0.001	Learning rate to update the weights
generate	False	To run the generation mode, it's required to generate text using the pre-trained model. So you should train a model first
max_new_tokens	100	The amount of tokens that will be generated if `generate` flag is active
model	gpt	Hierarchical mlp (hmlp), mlp model (mlp) or gpt (gpt) model to train
n_consecutive	2	The amount of consecutive tokens to concatenate in the hierarchical model
n_layers	4	Number of processor blocks in the model, check out the models in `layers.py` for more information about its usage
num_heads	3	Number of self-attention heads in the multi-head self-attention layer in GPT implementation
num_blocks	2	Number of layer blocks given the model. Sequential linear blocks for MLP and Hierarchical MLP, DecoderTransformerBlocks for GPT
context	None	The context for the text generation, please try to use a longer context than the `block_size` (required if `generate` is True)
device	cpu	The device to train the models on, available values are `mps`, `cpu` and `cuda`.

The training will generate several artifacts and will save them in the artifacts directory. The saved artifacts include the model, the data loaders, calculated losses along the training, and finally the tokenizer to use the constructed character-level vocabulary.

How to generate new text?

You can generate text after training a model first. That's because the generation pipeline makes use of the saved artifacts. In order to start the generation, you need to pass the generate flag:

python main.py --generate --context="Juliet," --max_new_tokens=100
>>> juliet, and have know lie thee why!

The generation will run until the wanted character length is matched.

Further plans

Contributing

While I'm open to new feature ideas and stuff, please let me do the coding part since I'm trying to improve my overall understanding. Thus, I'd love to accept any feature requests as new PRs. You can reach me from Discord (@alperiox) or my e-mail address (alper_balbay@hacettepe.edu.tr)

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
.gitignore		.gitignore
README.md		README.md
fileloaders.py		fileloaders.py
main.py		main.py
net.py		net.py
processors.py		processors.py
pyproject.toml		pyproject.toml
romeo-and-juliet.txt		romeo-and-juliet.txt
samplers.py		samplers.py
tokenizers.py		tokenizers.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

bookbot

Usage

Installation

How to train the network?

How to generate new text?

Further plans

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

alperiox/bookbot

Folders and files

Latest commit

History

Repository files navigation

bookbot

Usage

Installation

How to train the network?

How to generate new text?

Further plans

Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages