Gradient Checkpointing, Serialized Tokenizers, initial implementation of schema generation
0.4.0 is a big release! The release of transformers
4.3.0 caused a breaking issue which required a more prompt release; a bug fix 0.4.1 release is likely. (and new documentation is in the process as well). I have demos of new features planned as well!
Update transformers and pytorch-lightning
The minimum version of transformers has been bumped to 4.3.0, which has a number of performance improvements such as faster GPT-2 training and generation. Fast tokenizers are now the default package-wide as well. pytorch-lighting was bumped to 1.2.0 albeit that is less exciting.
tokenizers
was removed as a dependency since transformers
pins its own. Speaking of...
Serialized custom tokenizers.
By default, train_tokenizer()
will create a serialized, one-file tokenizer. (e.g. aitextgen.tokenizer.json
). This file will also correctly support added_tokens
parameter.
You can load the tokenizer when loading an aitextgen model with the tokenizer_file
param. (you can still use the merges_file
and vocab_file
params if you have them.
Gradient checkpointing + Layer freezing
Gradient checkpointing is now supported for GPT-2 models, allowing finetuning of larger models such as the 355M and 774M GPT-2 models!
This also enabled the 1558M model to be finetuned, in theory. I also added the ability to freeze layers to allow the model to be trained within VRAM constraints, but the results are mixed. More analysis will be done.
Schema-based generation
A draft implementation of schema based generation (leveraging the new custom tokenizers)
Misc bug fixes
- Fix the TensorFlow weights URL
- Fixed issue where prompt character length was used to check for a too-long assert instead of prompt token length (#90)
- Workaround breaking issue in Transformers 4.3.0 by moving special token stripping into aitextgen instead of the tokenizer (#90)
- Added an lstrip param to generation, which strips all whitespace at the beginning of generated text (related to point above)
- The refresh rate of training is every 20 steps by default. (for better performance in Jupyter/Colab)