Releases · minimaxir/aitextgen

When training, a new field is automativally written to the config: line_by_line, which indicates whether the source TokenDataset used was ingested line_by_line (e.g. a CSV file).
When generating, if the loaded model config has line_by_line=True, then the generation will automatically prepend the text with the bos_token so the generation knows it's at the start of the text. This results in substantially better text generation quality.

If you have an older model trained on a line_by_line dataset, you can still use this workflow by making one of the following changes:

Manually add "line_by_line": true to the config.json for the model.
When the model is loaded, call setattr(ai.model.config, "line_by_line", True)
Set the new prepend_bos parameter to generate() to True.

Misc fixes

Improvements to generation w/ a schema so it works more correctly.
Loading a tokenizer via tokenizer_file now uses the PreTrainedTokenizerFast class, which handles special tokens more correctly.
Added a skip_special_tokens param to force printing of generation tokens: good for debugging generation w/ schema

Assets 3

19 Apr 01:15

minimaxir

v0.5.0

8a44c4d

v0.5.0: GPT Neo + misc fixes

aitextgen has been updated to support GPT Neo and fix a few outstanding generation issues! However, in the process there are a few breaking changes.

Breaking Changes

Loading Models

While making model-loading architecture-agnostic for GPT Neo support, it turns out aitextgen was loading models in an unofficial way, so this has now been addressed. The user must now specify the model_folder where the pytorch_model.bin and config.json are located (with those exact filenames).

Assuming the model is located in trained_folder:

Old :

ai2 = aitextgen(model="trained_model/pytorch_model.bin",
                tokenizer_file="aitextgen.tokenizer.json",
                config="trained_model/config.json")

New:

ai2 = aitextgen(model_folder="trained_model",
                	   tokenizer_file="aitextgen.tokenizer.json")

All notebooks and documentation have been updated with this new workflow, and an assert will be raised of the old behavior is still used.

Incorrect tokenization for Colab-trained GPT-2 tokenizers.

There was an underlying issue due to a recent change in tokenizers which broke the implementation of the default GPT-2 tokenizer by preventing it from tokenizing <|endoftext|> tokens correctly. As a result, this broke the truncation

Only the case where the Colab GPT-2 Notebook was used for training line-by-line texts were affected by this; unfortunately the only fix now is to retrain the model with v0.5.0

Other Major Changes/Fixes

GPT Neo support

GPT Neo is now supported! The Colab Notebook was updated to indicate how to finetune the smaller versions of the model.

Out of the box, all variants of GPT-Neo have a 2048 context window (versus GPT-2’s 1024 context length) allowing double the generation length, and the pretrained models are trained on much more recent data. Finetuning a GPT Neo model takes about 2x as long per step as a GPT-2 model: notable as normally increasing the context window causes training to scale quadraticly instead of linearly, and does appear to converge faster.

However, text-generation performance-wise, it’s currently unclear whether GPT-Neo is “better”, especially on short-form content. Future releases of aitextgen will analyze this more closely.

DeepSpeed support [BETA] (#103)

Thanks to the team at pytorch-lightning, DeepSpeed support has been added for aitextgen, allowing training of larger models (>1.5B params) with multi-GPUs. However, this isn’t fully tested, so more documentation is pending!

Misc changes

Added a nonempty_output param to generate(), default True: If the output is empty (possible on shortform content), skip it if generating multiple texts, or try again if it's a single text. If min_length is specified, the same behavior occurs for texts below the minimum length after processing.
Bumped minimum versions of transformers and pytorch-lightning.
Completed another pass of notebooks and documentation.
Forced single-GPU training on Windows to avoid bugs (#116)
Calling the aitextgen instance will now print the model type and number of params to the console, helpful for debugging.

Assets 3

09 Mar 04:37

minimaxir

v0.4.1

3539154

v0.4.1: Misc Bug Fixes

Fix CSV loading issue (#95)
Fix regex for stripping whitespace starting a generated text. (#92)
Fix an issue where the logger said using a custom tokenizer was actually using the default tokenizer.
Added a special_tokens param to allow the user to specify a List of token IDs to strip from the generated output (default: the bos_token_id and eos_token_id).

Assets 3

23 Feb 04:21

minimaxir

v0.4.0

8dbc362

Gradient Checkpointing, Serialized Tokenizers, initial implementation of schema generation

0.4.0 is a big release! The release of transformers 4.3.0 caused a breaking issue which required a more prompt release; a bug fix 0.4.1 release is likely. (and new documentation is in the process as well). I have demos of new features planned as well!

Update transformers and pytorch-lightning

The minimum version of transformers has been bumped to 4.3.0, which has a number of performance improvements such as faster GPT-2 training and generation. Fast tokenizers are now the default package-wide as well. pytorch-lighting was bumped to 1.2.0 albeit that is less exciting.

tokenizers was removed as a dependency since transformers pins its own. Speaking of...

Serialized custom tokenizers.

By default, train_tokenizer() will create a serialized, one-file tokenizer. (e.g. aitextgen.tokenizer.json). This file will also correctly support added_tokens parameter.

You can load the tokenizer when loading an aitextgen model with the tokenizer_file param. (you can still use the merges_file and vocab_file params if you have them.

Gradient checkpointing + Layer freezing

Gradient checkpointing is now supported for GPT-2 models, allowing finetuning of larger models such as the 355M and 774M GPT-2 models!

This also enabled the 1558M model to be finetuned, in theory. I also added the ability to freeze layers to allow the model to be trained within VRAM constraints, but the results are mixed. More analysis will be done.

Schema-based generation

A draft implementation of schema based generation (leveraging the new custom tokenizers)

Misc bug fixes

Fix the TensorFlow weights URL
Fixed issue where prompt character length was used to check for a too-long assert instead of prompt token length (#90)
Workaround breaking issue in Transformers 4.3.0 by moving special token stripping into aitextgen instead of the tokenizer (#90)
Added an lstrip param to generation, which strips all whitespace at the beginning of generated text (related to point above)
The refresh rate of training is every 20 steps by default. (for better performance in Jupyter/Colab)

Assets 3

01 Dec 03:30

minimaxir

v0.3.0

f7278bf

Transformers 4.0.0 and pytorch-lightning 1.0.0 support

A release to fix breaking issues from both packages, with minor tweaks done in the meantime.

Minimum versions are now transformers>=4.0.0, pytorch-lightning>=1.0.8, and torch>=1.6.0, with fixed to breaking issues for all those major versions.
Tweaked generation to be more canonical with the newest implementation in transformers 4.
Set default refresh rate for training to 20 to make pytorch-lightning happy.
Set default learning rate for training to 1e-3 since I forgot why it was 1e-4.
Set both the default vocab size for tokenizers and the CPU config vocab size to 1000 tokens from 5000, since this allowed much easier/faster training in the demo.
Confirmed that setting fp16=True for GPU training with supported GPUs now works.

Future releases will add more explicit features. There may be extra console output in the meantime; will see what I can do to remove those.

Assets 3

05 Jul 16:25

minimaxir

v0.2.3

d889163

Remove optimizer_step() override

This fixes an issue (#44) causing training to fail due to a change in pytorch-lightning 0.8.4

The override was only for testing; removing it is necessary for upcoming native AMP in PyTorch 1.6 regardless.

Somehow, after this change, the model decreases in loss much faster: may need to investigate if the scheduler no longer works.

Assets 3

02 Jul 05:14

minimaxir

v0.2.2

968c02f

Cap transformers version

transformers 3.0.0 introduced some breaking changes, so capping version at less than that for now.

Assets 3

28 Jun 19:29

minimaxir

v0.2.1

e0848e1

Fix depreciated training parameter

v0.2.1

Remove disable_validation param

Assets 3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Short Form Generation

Misc fixes

Breaking Changes

Loading Models

Incorrect tokenization for Colab-trained GPT-2 tokenizers.

Other Major Changes/Fixes

GPT Neo support

DeepSpeed support [BETA] (#103)

Misc changes

Update transformers and pytorch-lightning

Serialized custom tokenizers.

Gradient checkpointing + Layer freezing

Schema-based generation

Misc bug fixes

Releases: minimaxir/aitextgen

v0.6.0: Fix pytorch-lighting version

v0.5.2: Fix dependency

v0.5.1: Short-Form Generation Fixes

Short Form Generation

Misc fixes

v0.5.0: GPT Neo + misc fixes

Breaking Changes

Loading Models

Incorrect tokenization for Colab-trained GPT-2 tokenizers.

Other Major Changes/Fixes

GPT Neo support

DeepSpeed support [BETA] (#103)

Misc changes

v0.4.1: Misc Bug Fixes

Gradient Checkpointing, Serialized Tokenizers, initial implementation of schema generation

Update transformers and pytorch-lightning

Serialized custom tokenizers.

Gradient checkpointing + Layer freezing

Schema-based generation

Misc bug fixes

Transformers 4.0.0 and pytorch-lightning 1.0.0 support

Remove optimizer_step() override

Cap transformers version

Fix depreciated training parameter