Release Gradient Checkpointing, Serialized Tokenizers, initial implementation of schema generation · minimaxir/aitextgen

0.4.0 is a big release! The release of transformers 4.3.0 caused a breaking issue which required a more prompt release; a bug fix 0.4.1 release is likely. (and new documentation is in the process as well). I have demos of new features planned as well!

Update transformers and pytorch-lightning

The minimum version of transformers has been bumped to 4.3.0, which has a number of performance improvements such as faster GPT-2 training and generation. Fast tokenizers are now the default package-wide as well. pytorch-lighting was bumped to 1.2.0 albeit that is less exciting.

tokenizers was removed as a dependency since transformers pins its own. Speaking of...

Serialized custom tokenizers.

By default, train_tokenizer() will create a serialized, one-file tokenizer. (e.g. aitextgen.tokenizer.json). This file will also correctly support added_tokens parameter.

You can load the tokenizer when loading an aitextgen model with the tokenizer_file param. (you can still use the merges_file and vocab_file params if you have them.

Gradient checkpointing + Layer freezing

Gradient checkpointing is now supported for GPT-2 models, allowing finetuning of larger models such as the 355M and 774M GPT-2 models!

This also enabled the 1558M model to be finetuned, in theory. I also added the ability to freeze layers to allow the model to be trained within VRAM constraints, but the results are mixed. More analysis will be done.

Schema-based generation

A draft implementation of schema based generation (leveraging the new custom tokenizers)

Misc bug fixes

Fix the TensorFlow weights URL
Fixed issue where prompt character length was used to check for a too-long assert instead of prompt token length (#90)
Workaround breaking issue in Transformers 4.3.0 by moving special token stripping into aitextgen instead of the tokenizer (#90)
Added an lstrip param to generation, which strips all whitespace at the beginning of generated text (related to point above)
The refresh rate of training is every 20 steps by default. (for better performance in Jupyter/Colab)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradient Checkpointing, Serialized Tokenizers, initial implementation of schema generation

Update transformers and pytorch-lightning

Serialized custom tokenizers.

Gradient checkpointing + Layer freezing

Schema-based generation

Misc bug fixes