Releases: minimaxir/aitextgen
v0.6.0: Fix pytorch-lighting version
Unfortunately I've been keeping my head down for working on a new version of the trainer that I missed a depreciation in pytorch lightning.
- Merged #191
- Bumped minimum version of pytorch-lightning to 1.7.0
Again, I aim to move to a HF-based trainer to avoid these depreciations.
v0.5.2: Fix dependency
Pytorch-lightning depreciated a feature which broke training; this is now fixed.
- prelim TPU support + more correct pytorch-lightning training (#105)
- Bump min pytorch-lightning version to 1.3.1
- fixes for schema generation
v0.5.1: Short-Form Generation Fixes
v0.5.1 of aitextgen fixes a long-standing generation bug for short-form content that was inadvertently broken. This has now been fixed.
Short Form Generation
- When training, a new field is automativally written to the config:
line_by_line
, which indicates whether the sourceTokenDataset
used was ingestedline_by_line
(e.g. a CSV file). - When generating, if the loaded model config has
line_by_line=True
, then the generation will automatically prepend the text with thebos_token
so the generation knows it's at the start of the text. This results in substantially better text generation quality.
If you have an older model trained on a line_by_line
dataset, you can still use this workflow by making one of the following changes:
- Manually add
"line_by_line": true
to theconfig.json
for the model. - When the model is loaded, call
setattr(ai.model.config, "line_by_line", True)
- Set the new
prepend_bos
parameter togenerate()
toTrue
.
Misc fixes
- Improvements to generation w/ a schema so it works more correctly.
- Loading a tokenizer via
tokenizer_file
now uses thePreTrainedTokenizerFast
class, which handles special tokens more correctly. - Added a
skip_special_tokens
param to force printing of generation tokens: good for debugging generation w/ schema
v0.5.0: GPT Neo + misc fixes
aitextgen has been updated to support GPT Neo and fix a few outstanding generation issues! However, in the process there are a few breaking changes.
Breaking Changes
Loading Models
While making model-loading architecture-agnostic for GPT Neo support, it turns out aitextgen was loading models in an unofficial way, so this has now been addressed. The user must now specify the model_folder
where the pytorch_model.bin
and config.json
are located (with those exact filenames).
Assuming the model is located in trained_folder
:
Old :
ai2 = aitextgen(model="trained_model/pytorch_model.bin",
tokenizer_file="aitextgen.tokenizer.json",
config="trained_model/config.json")
New:
ai2 = aitextgen(model_folder="trained_model",
tokenizer_file="aitextgen.tokenizer.json")
All notebooks and documentation have been updated with this new workflow, and an assert will be raised of the old behavior is still used.
Incorrect tokenization for Colab-trained GPT-2 tokenizers.
There was an underlying issue due to a recent change in tokenizers
which broke the implementation of the default GPT-2 tokenizer by preventing it from tokenizing <|endoftext|>
tokens correctly. As a result, this broke the truncation
Only the case where the Colab GPT-2 Notebook was used for training line-by-line texts were affected by this; unfortunately the only fix now is to retrain the model with v0.5.0
Other Major Changes/Fixes
GPT Neo support
GPT Neo is now supported! The Colab Notebook was updated to indicate how to finetune the smaller versions of the model.
Out of the box, all variants of GPT-Neo have a 2048 context window (versus GPT-2’s 1024 context length) allowing double the generation length, and the pretrained models are trained on much more recent data. Finetuning a GPT Neo model takes about 2x as long per step as a GPT-2 model: notable as normally increasing the context window causes training to scale quadraticly instead of linearly, and does appear to converge faster.
However, text-generation performance-wise, it’s currently unclear whether GPT-Neo is “better”, especially on short-form content. Future releases of aitextgen will analyze this more closely.
DeepSpeed support [BETA] (#103)
Thanks to the team at pytorch-lightning, DeepSpeed support has been added for aitextgen, allowing training of larger models (>1.5B params) with multi-GPUs. However, this isn’t fully tested, so more documentation is pending!
Misc changes
-
Added a
nonempty_output
param togenerate()
, default True: If the output is empty (possible on shortform content), skip it if generating multiple texts, or try again if it's a single text. Ifmin_length
is specified, the same behavior occurs for texts below the minimum length after processing. -
Bumped minimum versions of
transformers
andpytorch-lightning
. -
Completed another pass of notebooks and documentation.
-
Forced single-GPU training on Windows to avoid bugs (#116)
-
Calling the aitextgen instance will now print the model type and number of params to the console, helpful for debugging.
v0.4.1: Misc Bug Fixes
- Fix CSV loading issue (#95)
- Fix regex for stripping whitespace starting a generated text. (#92)
- Fix an issue where the logger said using a custom tokenizer was actually using the default tokenizer.
- Added a
special_tokens
param to allow the user to specify a List of token IDs to strip from the generated output (default: thebos_token_id
andeos_token_id
).
Gradient Checkpointing, Serialized Tokenizers, initial implementation of schema generation
0.4.0 is a big release! The release of transformers
4.3.0 caused a breaking issue which required a more prompt release; a bug fix 0.4.1 release is likely. (and new documentation is in the process as well). I have demos of new features planned as well!
Update transformers and pytorch-lightning
The minimum version of transformers has been bumped to 4.3.0, which has a number of performance improvements such as faster GPT-2 training and generation. Fast tokenizers are now the default package-wide as well. pytorch-lighting was bumped to 1.2.0 albeit that is less exciting.
tokenizers
was removed as a dependency since transformers
pins its own. Speaking of...
Serialized custom tokenizers.
By default, train_tokenizer()
will create a serialized, one-file tokenizer. (e.g. aitextgen.tokenizer.json
). This file will also correctly support added_tokens
parameter.
You can load the tokenizer when loading an aitextgen model with the tokenizer_file
param. (you can still use the merges_file
and vocab_file
params if you have them.
Gradient checkpointing + Layer freezing
Gradient checkpointing is now supported for GPT-2 models, allowing finetuning of larger models such as the 355M and 774M GPT-2 models!
This also enabled the 1558M model to be finetuned, in theory. I also added the ability to freeze layers to allow the model to be trained within VRAM constraints, but the results are mixed. More analysis will be done.
Schema-based generation
A draft implementation of schema based generation (leveraging the new custom tokenizers)
Misc bug fixes
- Fix the TensorFlow weights URL
- Fixed issue where prompt character length was used to check for a too-long assert instead of prompt token length (#90)
- Workaround breaking issue in Transformers 4.3.0 by moving special token stripping into aitextgen instead of the tokenizer (#90)
- Added an lstrip param to generation, which strips all whitespace at the beginning of generated text (related to point above)
- The refresh rate of training is every 20 steps by default. (for better performance in Jupyter/Colab)
Transformers 4.0.0 and pytorch-lightning 1.0.0 support
A release to fix breaking issues from both packages, with minor tweaks done in the meantime.
- Minimum versions are now
transformers>=4.0.0
,pytorch-lightning>=1.0.8
, andtorch>=1.6.0
, with fixed to breaking issues for all those major versions. - Tweaked generation to be more canonical with the newest implementation in transformers 4.
- Set default refresh rate for training to 20 to make pytorch-lightning happy.
- Set default learning rate for training to
1e-3
since I forgot why it was1e-4
. - Set both the default vocab size for tokenizers and the CPU config vocab size to
1000
tokens from5000
, since this allowed much easier/faster training in the demo. - Confirmed that setting
fp16=True
for GPU training with supported GPUs now works.
Future releases will add more explicit features. There may be extra console output in the meantime; will see what I can do to remove those.
Remove optimizer_step() override
This fixes an issue (#44) causing training to fail due to a change in pytorch-lightning 0.8.4
The override was only for testing; removing it is necessary for upcoming native AMP in PyTorch 1.6 regardless.
Somehow, after this change, the model decreases in loss much faster: may need to investigate if the scheduler no longer works.
Cap transformers version
transformers 3.0.0
introduced some breaking changes, so capping version at less than that for now.
Fix depreciated training parameter
v0.2.1 Remove disable_validation param