Releases: allenai/allennlp
v1.1.0rc1
This is the first pre-release candidate for version 1.1. There will probably be at least more candidate before the true 1.1 release.
What's new since v1.0.0
Fixed
- Reduced the amount of log messages produced by
allennlp.common.file_utils
. - Fixed a bug where
PretrainedTransformerEmbedder
parameters appeared to be trainable
in the log output even whentrain_parameters
was set toFalse
. - Fixed a bug with the sharded dataset reader where it would only read a fraction of the instances
in distributed training. - Fixed checking equality of
ArrayField
s. - Fixed a bug where
NamespaceSwappingField
did not work correctly with.empty_field()
. - Put more sensible defaults on the
huggingface_adamw
optimizer. - Simplified logging so that all logging output always goes to one file.
- Fixed interaction with the python command line debugger.
- Log the grad norm properly even when we're not clipping it.
- Fixed a bug where
PretrainedModelInitializer
fails to initialize a model with a 0-dim tensor - Fixed a bug with the layer unfreezing schedule of the
SlantedTriangular
learning rate scheduler. - Fixed a regression with logging in the distributed setting. Only the main worker should write log output to the terminal.
- Pinned the version of boto3 for package managers (e.g. poetry).
- Fixed issue #4330 by updating the
tokenizers
dependency. - Fixed a bug in
TextClassificationPredictor
so that it passes tokenized inputs to theDatasetReader
in case it does not have a tokenizer. reg_loss
is only now returned for models that have some regularization penalty configured.- Fixed a bug that prevented
cached_path
from downloading assets from GitHub releases. - Fixed a bug that erronously increased last label's false positive count in calculating fbeta metrics.
Tqdm
output now looks much better when the output is being piped or redirected.- Small improvements to how the API documentation is rendered.
Added
- A method to ModelTestCase for running basic model tests when you aren't using config files.
- Added some convenience methods for reading files.
- Added an option to
file_utils.cached_path
to automatically extract archives. - Added the ability to pass an archive file instead of a local directory to
Vocab.from_files
. - Added the ability to pass an archive file instead of a glob to
ShardedDatasetReader
. - Added a new
"linear_with_warmup"
learning rate scheduler. - Added a check in
ShardedDatasetReader
that ensures the base reader doesn't implement manual
distributed sharding itself. - Added an option to
PretrainedTransformerEmbedder
andPretrainedTransformerMismatchedEmbedder
to use a
scalar mix of all hidden layers from the transformer model instead of just the last layer. To utilize
this, just setlast_layer_only
toFalse
. cached_path()
can now read files inside of archives.
Changed
- Not specifying a
cuda_device
now automatically determines whether to use a GPU or not. - Discovered plugins are logged so you can see what was loaded.
allennlp.data.DataLoader
is now an abstract registrable class. The default implementation
remains the same, but was renamed toallennlp.data.PyTorchDataLoader
.BertPooler
can now unwrap and re-wrap extra dimensions if necessary.- New
transformers
dependency. Only version >=3.0 now supported.
Commits
4eb9795 Prepare for release v1.1.0rc1
f195440 update 'Models' links in README (#4475)
9c801a3 add CHANGELOG to API docs, point to license on GitHub, improve API doc formatting (#4472)
69d2f03 Clean up Tqdm bars when output is being piped or redirected (#4470)
7b188c9 fixed bug that erronously increased last label's false positive count (#4473)
64db027 Skip ETag check if OSError (#4469)
b9d011e More BART changes (#4468)
7a563a8 add option to use scalar mix of all transformer layers (#4460)
d00ad66 Minor tqdm and logging clean up (#4448)
6acf205 Fix regloss logging (#4449)
8c32ddf Fixing bug in TextClassificationPredictor so that it passes tokenized inputs to the DatasetReader (#4456)
b9a9164 Update transformers requirement from <2.12,>=2.10 to >=2.10,<3.1 (#4446)
181ef5d pin boto3 to resolve some dependency issues (#4453)
c75a1eb ensure base reader of ShardedDatasetReader doesn't implement sharding itself (#4454)
8a05ad4 Update CONTRIBUTING.md (#4447)
5b988d6 ensure only rank 0 worker writes to terminal (#4445)
8482f02 fix bug with SlantedTriangular LR scheduler (#4443)
e46a578 Update transformers requirement from <2.11,>=2.10 to >=2.10,<2.12 (#4411)
8229aca Fix pretrained model initialization (#4439)
60deece Fix type hint in text_field.py (#4434)
23e549e More multiple-choice changes (#4415)
6d0a4fd generalize DataLoader (#4416)
acd9995 Automatic file-friendly logging (#4383)
637dbb1 fix README, pin mkdocs, update mkdocs-material (#4412)
9c4dfa5 small fix to pretrained transformer tokenizer (#4417)
84988b8 Log plugins discovered and filter out transformers "PyTorch version ... available" log message (#4414)
54c41fc Adds the ability to automatically detect whether we have a GPU (#4400)
96ff585 Changes from my multiple-choice work (#4368)
eee15ca Assign an empty mapping array to empty fields of NamespaceSwappingField
(#4403)
aa2943e Bump mkdocs-material from 5.3.2 to 5.3.3 (#4398)
7fa7531 fix eq method of ArrayField (#4401)
e104e44 Add test to ensure data loader yields all instances when batches_per_epoch is set (#4394)
b6fd697 fix sharded dataset reader (#4396)
30e5dbf Bump mypy from 0.781 to 0.782 (#4395)
b0ba2d4 update version
1d07cc7 Bump mkdocs-material from 5.3.0 to 5.3.2 (#4389)
ffc5184 ensure Vocab.from_files and ShardedDatasetReader can handle archives (#4371)
20afe6c Add Optuna integrated badge to README.md (#4361)
ba79f14 Bump mypy from 0.780 to 0.781 (#4390)
85e531c Update README.md (#4385)
c2ecb7a Add a method to ModelTestCase for use without config files (#4381)
6852def pin some doc building requirements (#4386)
bf422d5 Add github template for using your own python run script (#4380)
ebde6e8 Bump overrides from 3.0.0 to 3.1.0 (#4375)
e52b751 ensure transformer params are frozen at initialization when train_parameters is false (#4377)
3e8a9ef Add link to new template repo for config file development (#4372)
4f70bc9 tick version for nightly releases
63a5e15 Update spacy requirement from <2.3,>=2.1.0 to >=2.1.0,<2.4 (#4370)
ef7c75b reduce amount of log messages produced by file_utils (#4366)
v1.0.0
The 1.0 version of AllenNLP is the culmination of more than 500 commits over the course of several months of work from our engineering team. The AllenNLP library has had wide-reaching appeal so far in its lifetime, and this 1.0 release represents an important maturity milestone. While we will continue to move fast to keep up with the ever-changing state of the art, we will be increasingly conscious of the effect future API changes have on our existing user base.
This release touches almost every aspect of the library, ranging from improving documentation to adding new natural-language processing components, to adjusting our APIs so they serve the community for the long haul. While we cannot summarize everything in these release notes, here are some of the main milestones for the 1.0 release.
-
We are releasing several new models, such as:
a. TransformerQA, a reading comprehension model (paper, demo)
b. An improved coreference model, with a 17% absolute improvement (architecture paper/embedder paper, demo)
c. The NMN reading comprehension model (paper, demo)
d. The RoBERTa models for textual entailment, or NLI (paper, demo) -
We have new introductory material in the form of an interactive guide, showing how to use library components and our experiment framework. The guide's goal is to provide a comprehensive introduction to AllenNLP for people with a good understanding of machine learning, Python, and some PyTorch.
-
We have improved performance across the library.
a. Switching to native PyTorch data loading, which is not only much faster but also allows the three main parts of the library (data, model, and training) to interoperate with any native PyTorch code.
b. Enabled support for 16-bit floating point through Apex.
c. Multi-GPU training now utilizes a separate Python process for each GPU. These workers communicate using PyTorch'sdistributed
module. This is more efficient than the old system which used a single Python process and was therefore limited by the GIL. -
We separated our models into a model repository (allennlp-models), so we have a lean core library with fewer dependencies.
-
We dramatically simplified how AllenNLP code corresponds to AllenNLP configuration files, which also makes the library easy to use from raw Python.
But changes are not limited to these. Some other highlights are that we have:
- Support for gradient accumulation.
- Improved configurability of the trainer so you can inject your own call on each batch.
- Seamless support for using word-piece tokenization on pre-tokenized text.
- A sampler that creates batches with roughly equal numbers of tokens.
- Unified support for Huggingface's transformer library.
- Support for token type IDs throughout the library.
- Nightly releases of the library to pip.
- BLEU and ROUGE metrics.
Updates since v1.0.0rc6
Fixed
- Lazy dataset readers now work correctly with multi-process data loading.
- Fixed race conditions that could occur when using a dataset cache.
Added
- A bug where where all datasets would be loaded for vocab creation even if not needed.
- A parameter to the
DatasetReader
class:manual_multi_process_sharding
. This is similar
to themanual_distributed_sharding
parameter, but applies when using a multi-process
DataLoader
.
Commits
29f3b6c Prepare for release v1.0.0
a8b840d fix some formatting issues in README (#4365)
d3ed619 fix Makefile
c554910 quick doc fixes (#4364)
b764bef simplify dataset classes, fix multi-process lazy loading (#4344)
884a614 Bump mkdocs-material from 5.2.3 to 5.3.0 (#4359)
6a124d8 ensure 'from_files' vocab doesn't load instances (#4356)
87c23e4 Fix handling of "datasets_for_vocab_creation" param (#4350)
c3755d1 update CHANGELOG
Upgrade guide from v0.9.0
There are too many changes to be exhaustive, but here is a list of the most common issues:
- You can continue to use the
allennlp
command line, but if you want to invoke it through Python, usepython -m allennlp <command>
instead ofpython -m allennlp.run <command>
. "bert_adam"
is now"adamw"
.- We no longer support the
"gradient_accumulation_batch_size"
parameter to the trainer. Use"num_gradient_accumulation_steps"
instead.
Using the transformers
library
AllenNLP 1.0 replaces the mash-mash of transformer libraries and dependencies that we had in v0.9.0
, and replaces it with one implementation that uses https://github.com/huggingface/transformers under the hood. For cases where you can work directly with the word pieces that are used by the transformers, use "pretrained_transformer"
for tokenizers, indexers, and embedders. If you want to use tokens from pre-tokenized text, use ""pretrained_transformer_mismatched"
. The latter turns the text into word pieces, embeds them with the transformer, and then combines word pieces to produce an embedding for the original tokens.
The parameters requires_grad
and top_layer_only
are no longer supported. If you are converting an old model that used to use "bert-pretrained"
, this is important! requires_grad
used to be False
by default, so it would not train the transformer itself. This saves memory and time at the cost of performance. The new code does not support this setting, and will always train the transformer. You can prevent this by setting requires_grad
to False
in a parameter group when setting up the optimizer.
You no longer need to specify do_lowercase
, as this is handled automatically now.
Config file changes
In 1.0, we simplified how FromParams
works. As a result, some things in the config files need to change to work with 1.0:
- The way Vocabulary options are specified in config files has changed. See #3550. If you want to load a vocabulary from files, you should specify
"type": "from_files"
, and use the key"directory"
instead of"directory_path"
. - When instantiating a
BasicTextFieldEmbedder
from_params
, you used to be able to have embedder names be top-level keys in the config file (e.g.,"embedder": {"elmo": ELMO_PARAMS, "tokens": TOKEN_PARAMS}
). We changed this a long time ago to prefer wrapping them in a"token_embedders"
key, and this is now required (e.g.,"embedder": {"token_embedders": {"elmo": ELMO_PARAMS, "tokens": TOKEN_PARAMS}}
). - The
TokenCharactersEncoder
now requires you to specify thevocab_namespace
for the underlying embedder. It used to default to"token_characters"
, matching theTokenCharactersIndexer
default, but making that work required some custom magic that wasn't worth the complexity. So instead of"token_characters": {"type": "character_encoding", "embedding": {"embedding_dim": 25}, "encoder": {...}}
, you need to change this to:"token_characters": {"type": "character_encoding", "embedding": {"embedding_dim": 25, "vocab_namespace": "token_characters"}, "encoder": {...}}
- Regularization now needs another key in a config file. Instead of specifying regularization as
"regularizer": [[regex1, regularizer_params], [regex2, regularizer_params]]
, it now must be specified as"regularizer": {"regexes": [[regex1, regularizer_params], [regex2, regularizer_params]]}
. - We changed initialization in a similar way to regularization. Instead of specifying initialization as
"initializer": [[regex1, initializer_params], [regex2, initializer_params]]
, it now must be specified as"initializer": {"regexes": [[regex1, initializer_params], [regex2, initializer_params]]}
. Also, you used to be able to haveinitializer_params
be"prevent"
, to prevent initialization of matching parameters. This is now done with a separate key passed to the initializer: `"initializer": {"regexes": [..], "prevent_regexes": [regex1, regex2]}. num_serialized_models_to_keep
andkeep_serialized_model_every_num_seconds
used to be able to be passed as top-level parameters to thetrainer
, but now they must always be passed to thecheckpointer
instead. For example, if you had"trainer": {"num_serialized_models_to_keep": 1}
, it now needs to be"trainer": {"checkpointer": {"num_serialized_models_to_keep": 1}}
. Also, the default for that setting is now2
, so AllenNLP will no longer fill up your hard drive!- Tokenizer specification changed because of #3361. Instead of something like
"tokenizer": {"word_splitter": {"type": "spacy"}}
, you now just do"tokenizer": {"type": "spacy"}
(more technically: theWordTokenizer
has now been removed, with the things we used to callWordSplitters
now just moved up to be top-levelTokenizers
themselves). - The
namespace_to_cache
argument toElmoTokenEmbedder
has been removed as a config file option. You can still passvocab_to_cache
to the constructor of this class, but this functionality is no longer available from a config file. If you used this and are really put out by this change, let us know, and we'll see what we can do.
Iterators ➔ DataLoaders
Allennlp now uses PyTorch's API for data iteration, rather than our own custom one. This means that train_data
, validation_data
, iterator
and validation_iterator
arguments to the Trainer
have been removed and replaced with data_loader
and validation_dataloader
.
Previous config files which looked like:
{
"iterator": {
"type": "bucket",
...
v1.0.0rc6
Fixed
- A bug where
TextField
s could not be duplicated since some tokenizers cannot be deep-copied.
See #4270. - Our caching mechanism had the potential to introduce race conditions if multiple processes
were attempting to cache the same file at once. This was fixed by using a lock file tied to each
cached file. get_text_field_mask()
now supports padding indices that are not0
.- A bug where
predictor.get_gradients()
would return an empty dictionary if an embedding layer had trainable set to false - Fixes
PretrainedTransformerMismatchedIndexer
in the case where a token consists of zero word pieces. - Fixes a bug when using a lazy dataset reader that results in a
UserWarning
from PyTorch being printed at
every iteration during training. - Predictor names were inconsistently switching between dashes and underscores. Now they all use underscores.
Predictor.from_path
now automatically loads plugins (unless you specifyload_plugins=False
) so
that you don't have to manually import a bunch of modules when instantiating predictors from
an archive path.allennlp-server
automatically found as a plugin once again.
Added
- A
duplicate()
method onInstance
s andField
s, to be used instead ofcopy.deepcopy()
- A batch sampler that makes sure each batch contains approximately the same number of tokens (
MaxTokensBatchSampler
) - Functions to turn a sequence of token indices back into tokens
- The ability to use Huggingface encoder/decoder models as token embedders
- Improvements to beam search
- ROUGE metric
- Polynomial decay learning rate scheduler
- A
BatchCallback
for logging CPU and GPU memory usage to tensorboard. This is mainly for debugging
because using it can cause a significant slowdown in training. - Ability to run pretrained transformers as an embedder without training the weights
Changed
- Similar to our caching mechanism, we introduced a lock file to the vocab to avoid race
conditions when saving/loading the vocab from/to the same serialization directory in different processes. - Changed the
Token
,Instance
, andBatch
classes along with allField
classes to "slots" classes. This dramatically reduces the size in memory of instances. - SimpleTagger will no longer calculate span-based F1 metric when
calculate_span_f1
isFalse
. - CPU memory for every worker is now reported in the logs and the metrics. Previously this was only reporting the CPU memory of the master process, and so it was only
correct in the non-distributed setting. - To be consistent with PyTorch
IterableDataset
,AllennlpLazyDataset
no longer implements__len__()
.
Previously it would always return 1. - Removed old tutorials, in favor of the new AllenNLP Guide
- Changed the vocabulary loading to consider new lines for Windows/Linux and Mac.
Commits
d98d13b add 'allennlp_server' to default plugins (#4348)
33d0cd8 fix file utils test (#4349)
f4d330a Update vocabulary load to a system-agnostic newline (#4342)
2012fea remove links to tutorials in API docs (#4346)
3d8ce44 Fixes spelling in changelog
73289bc Consistently use underscores in Predictor names (#4340)
2d03c41 Allow using pretrained transformers without fine-tuning them (#4338)
8f68d69 load plugins from Predictor.from_path (#4333)
5c6cc3a Bump mkdocs-material from 5.2.2 to 5.2.3 (#4341)
7ab7551 Removing old tutorials, pointing to the new guide in the README (#4334)
902d36a Fix bug with lazy data loading, un-implement len on AllennlpLazyDataset (#4328)
11b5799 log metrics in alphabetical order (#4327)
7d66b3e report CPU memory usage for each worker (#4323)
06bac68 make Instance, Batch, and all field classes "slots" classes (#4313)
2b2d141 Bump mypy from 0.770 to 0.780 (#4316)
a038c01 Update transformers requirement from <2.11,>=2.9 to >=2.9,<2.12 (#4315)
345459e Stop calculating span-based F1 metric when calculate_span_f1
is False
. (#4302)
fc47bf6 Deals with the case where a word doesn't have any word pieces assigned (#4301)
11a08ae Making Token class a "slots" class (#4312)
32bccfb Fix a bug where predictor.get_gradients() would return an empty... (#4305)
33a4945 ensure CUDA available in GPU checks workflow (#4310)
d51ffa1 Update transformers requirement from <2.10,>=2.9 to >=2.9,<2.11 (#4282)
75c07ab Merge branch 'master' of github.com:allenai/allennlp
8c9421d fix Makefile
77b432f Update README.md (#4309)
720ad43 A few small fixes in the README.md (#4307)
a7265c0 move tensorboard memory logging to BatchCallback (#4306)
91d0fa1 remove setup.cfg (#4300)
5ad7a33 Support for bart in allennlp-models (#4169)
25134f2 add lock file within caching and vocab saving/loading mechanisms (#4299)
58dc84e add 'Feature request' label to template
9526f00 Update issue templates (#4293)
79999ec Adds a "duplicate()" method on instances and fields (#4294)
8ff47d3 Set version to rc6
v1.0.0rc5
Fixed
- Fix bug where
PretrainedTransformerTokenizer
crashed with some transformers (#4267) - Make
cached_path
work offline. - Tons of docstring inconsistencies resolved.
- Nightly builds no longer run on forks.
- Distributed training now automatically figures out which worker should see which instances
- A race condition bug in distributed training caused from saving the vocab to file from the master process while other processing might be reading those files.
- Unused dependencies in
setup.py
removed
Added
- Additional CI checks to ensure docstrings are consistently formatted.
- Ability to train on CPU with multiple processes by setting
cuda_devices
to a list of negative integers in your training config. For example:"distributed": {"cuda_devices": [-1, -1]}
. This is mainly to make it easier to test and debug distributed training code. - Documentation for when parameters don't need config file entries
Changed
- The
allennlp test-install
command now just ensures the core submodules can
be imported successfully, and prints out some other useful information such as the version, PyTorch version, and the number of GPU devices available. - All of the tests moved from
allennlp/tests
totests
at the root level, and
allennlp/tests/fixtures
moved totest_fixtures
at the root level. The PyPI source and wheel distributions will no longer include tests and fixtures.
Commits
7dcc60b Update version for release v1.0.0rc5
f421e91 clean up dependencies (#4290)
a9be961 Bump mkdocs-material from 5.2.0 to 5.2.2 (#4288)
69fc5b4 Update saliency_interpreter.py (#4286)
e52fea2 Makes the EpochCallback work the same way as the BatchCallback (#4277)
6574823 Make special token inference logic more robust (#4267)
24617c0 Bump overrides from 2.8.0 to 3.0.0 (#4249)
f7d9673 Bump mkdocs-material from 5.1.6 to 5.2.0 (#4257)
5198a5c Document when parameters do not need an entry in a config file (#4275)
4ee2735 update contribution guidelines (#4271)
dacbb75 wait for non-master workers to finish reading vocab before master worker saves it (#4274)
f27475a Enable multi-process training on CPU (#4272)
7e683dd Workers in the distributed scenario need to see different instances (#4241)
9c51d6c move test and fixtures to root level and simplify test-install command (#4264)
65a146d Clean up the command to create a commit list. (#4263)
88683d4 switch to tokenless codecov upload (#4261)
b41d448 Add a CHANGELOG (#4260)
7d71398 make 'cached_path' work offline (#4253)
fc81067 move py2md back to scripts (#4251)
4de68a4 Improves API docs and docstring consistency (#4244)
1b0d231 tick version to rc5
Version 1.0.0 Release Candidate 4
Commits
3f19336 gitignore evalb binary (#4235)
189624d Make some arguments to evaluate() optional, add docstring (#4237)
5227420 remove pre-commit (#4236)
6701e59 Bump mkdocs-material from 5.1.5 to 5.1.6 (#4221)
e3d72fc Use the new tokenizers (#3868)
592c653 Add a more informative exception when there's no GPU available. (#4230)
114751f Attach nltk and HF caches to docker containers in CI (#4232)
7d9b72c remove unused path in test image (#4229)
89238d2 Remove unused path. (#4226)
b461f3f fix new linting errors (#4227)
0bcab36 consolidate testing decorators (#4213)
72061b1 ensure Docker images get the right name and tag (#4214)
edf91ac improve Docker workflows (#4210)
b916720 Add wordpiece_mask to default to bool tensor (#4206)
82bf58a Switch to pytest style test classes, use plain asserts (#4204)
743d2d8 Separate linting from formatting in CI, always run all steps of workflow (#4202)
4e47894 Add linke to allennlp-models in README (#4196)
1895743 Bump mkdocs-material from 5.1.4 to 5.1.5 (#4195)
b6e0ba9 attach allennlp cache to Docker images and fix apex test (#4197)
c458303 Fix Bug in evaluate script (#4199)
967660a remove namespace plugin mechanism (#4188)
c09833c Remove allennlp sparse_clip_grad and replace with torch clip_grad_norm_. (#4159)
42a4e63 Adds links in readme to stable and latest docs (#4186)
d67e721 Fix heuristic in util.get_token_ids_from_text_field_tensors (#4184)
2602c8f improve error message for Vocab.get_token_index (#4185)
31616de tweak torch requirement (#4166)
e99de85 Improve Docker-based workflows (#4183)
ab42189 update GPU checks CI (#4182)
e56992b Add failing from_archive test (#4156)
cbe7458 Add self-hosted runner for GPU checks (#4180)
2544e59 Find the right embedding layer for mismatched cases (#4179)
74c8404 Bump mkdocs-material from 5.1.3 to 5.1.4 (#4174)
0f8346d Fix XLNet token type number (#4125)
ca9118f ensure docs can build in PR workflow (#4178)
6038fd1 remove verbose mode of linters (#4176)
7cbeb6c Display activation functions as modules. (#4045)
be53f07 add other missing param to ReduceOnPlateau LR Scheduler (#4177)
706bf52 DOCS: Clean up the docs for commands (#4145)
82cae1b add missing param threshold_mode
(#4173)
52ae792 Use new env var in the allennlp-models build (#4172)
08874e9 remove elmo command (#4168)
4a6023b Fix logging (#4164)
26e313b tick version to rc4 for nightly releases
11f4307 Reduce number of warnings seen after running tests (#4153)
af890b2 add test for version convention (#4157)
d9dd503 ensure typing backport uninstalled first (#4162)
9d8862a Move default predictors (#4154)
b0c7ac7 Uninstall typing before running pip again (#4161)
0fe2839 Fixes an error with pip (#4158)
69e7511 Bump mkdocs-material from 5.1.1 to 5.1.3 (#4150)
dfe8b1c clean up scripts dir (#4152)
27e374a Fix from_params
when the class has no registered impl (#4090)
bc2435d Update RELEASE_PROCESS.md (#4151)
Version 1.0.0 Release Candidate 3
Commits
4720199 docs workflow quick fix
24e11a8 add shortcuts for 'stable' and 'latest' to docs (#4138)
b04d317 hard-code version (#4142)
c2e1be9 Fix broken link pointing to GitHub Actions (#4144)
d709e59 modified the make_output_human_readable method in basic_classifier for allennlp-demo (#4038)
6ea6c59 test logging errors work-around (#4139)
4b8edc5 Some tokenizers don's have padding tokens (#4131)
5e8b2ba Update torch requirement from <=1.4.0,>1.3.1 to >1.3.1,<1.6.0 (#4118)
2a063d0 Remove unused Dockerfiles (#4137)
1384a70 document release process (#4133)
b3fcf60 Push 'latest' tag versions of Docker images (#4134)
Version 1.0.0 Release Candidate 2
Commits
8f8288b Fix version suffix in release job and make deps caching more robust (#4130)
66651d1 Bump pre-commit from 2.2.0 to 2.3.0 (#4127)
fc82450 Fix typo in InitializerApplicator
docstring (#4100)
42b9387 Simplify GitHub Actions job graph (#4129)
7b51c22 Fix cron schedule (#4128)
583ea41 pypi release workflow HOT FIX
5d58993 add nightly release schedule (#4123)
f5a4f4f Update README with new CI links / badge (#4116)
8be8e85 Ensure all deps are up-to-date in CI (#4120)
98a2953 Simplify pull request workflow (#4117)
0056b9c fix docs output directory (#4113)
b4c7d76 Adds release workflow, fixes logging test and docs workflow (#4112)
0638cbb Fix param name in GradientDescentTrainer
doc (#4091)
ece5c59 Bump mkdocs-material from 5.1.0 to 5.1.1 (#4103)
78325af Update README.md (#4110)
15e6b03 Fix docs workflow (#4107)
10fdf27 More fixes to GitHub Actions (#4106)
c9b7cbd Fixes for GitHub Actions (#4105)
4eed5d4 Add custom 'logging.Logger' that has warn_once(), debug_once(), etc m… (#4062)
24de9b1 Add main CI workflow to GH Actions (#4097)
6cb944c Correctly computes the number of training steps (#4099)
b4e10dd Make sure the nightly build doesn't produce bogus version numbers
1beddcd a few more doc fixes (#4078)
51ef5b5 improve sorting_key error message (#4083)
039522a Merge branch 'release-v1.0.0.rc1'
3ead054 Merge branch 'master' of https://github.com/allenai/allennlp
3f5eabb Bump version numbers to v1.0.0.rc2-unreleased
25ba3a2 Fix error message when top level config key missing (#4081)
Version 1.0.0 Release Candidate 1
Breaking changes:
- WordSplitter has been replaced by Tokenizer. See #3361. (this also requires a config file change; see below)
- Models are now available at https://github.com/allenai/allennlp-models.
Model.decode()
was renamed toModel.make_output_human_readable()
trainer.cuda_device
no longer takes a list, usedistributed.cuda_devices
. See #3529.- TODO: As of #3529 dataset readers used in the distributed setting need to shard instances to separate processes internally. We should fix this before releasing a final version.
- Dataset caching is now handled entirely with a parameter passed to the dataset reader, not with command-line arguments. If you used the caching arguments to
allennlp train
, instead just add a"cache_directory"
key to your dataset reader parameters. - Sorting keys in the bucket iterator have changed; now you just give a field name that you want to sort by, and that's it. We also implemented a heuristic to figure out what the right sorting key is, so in almost all cases you can just remove the
sorting_keys
parameter from your config file and our code will do the right thing. Some cases where our heuristic might get it wrong are if you have aListField[TextField]
and you want the size of the list to be the padding key, or if you have anArrayField
with a long but constant dimension that shouldn't be considered in sorting. - The argument order to
Embedding()
changed (because we made an argument optional and had to move it; pretty unfortunate, sorry). It used to beEmbedding(num_embeddings, embedding_dim)
. Now it'sEmbedding(embedding_dim, num_embeddings)
.
Config file changes:
In 1.0, we made some significant simplifications to how FromParams
works, so you basically never see it in your code and don't ever have to think about adding custom from_params
methods. We just follow argument annotations and have the config file always exactly match the argument names in the constructors that are called.
This meant that we needed to remove cases where we previously allowed mismatches between argument names and keys that show up in config files. There are a number of places that were affected by this:
- The way Vocabulary options are specified in config files has changed. See #3550. If you want to load a vocabulary from files, you should specify
"type": "from_files"
, and use the key"directory"
instead of"directory_path"
. - When instantiating a
BasicTextFieldEmbedder
from_params
, you used to be able to have embedder names be top-level keys in the config file (e.g.,"embedder": {"elmo": ELMO_PARAMS, "tokens": TOKEN_PARAMS}
). We changed this a long time ago to prefer wrapping them in a"token_embedders"
key, and this is now required (e.g.,"embedder": {"token_embedders": {"elmo": ELMO_PARAMS, "tokens": TOKEN_PARAMS}}
). - The
TokenCharactersEncoder
now requires you to specify thevocab_namespace
for the underlying embedder. It used to default to"token_characters"
, matching theTokenCharactersIndexer
default, but making that work required some custom magic that wasn't worth the complexity. So instead of"token_characters": {"type": "character_encoding", "embedding": {"embedding_dim": 25}, "encoder": {...}}
, you need to change this to:"token_characters": {"type": "character_encoding", "embedding": {"embedding_dim": 25, "vocab_namespace": "token_characters"}, "encoder": {...}}
- Regularization now needs another key in a config file. Instead of specifying regularization as
"regularizer": [[regex1, regularizer_params], [regex2, regularizer_params]]
, it now must be specified as"regularizer": {"regexes": [[regex1, regularizer_params], [regex2, regularizer_params]]}
. - Initialization was changed in a similar way to regularization. Instead of specifying initialization as
"initializer": [[regex1, initializer_params], [regex2, initializer_params]]
, it now must be specified as"initializer": {"regexes": [[regex1, initializer_params], [regex2, initializer_params]]}
. Also, you used to be able to haveinitializer_params
be"prevent"
, to prevent initialization of matching parameters. This is now done with a separate key passed to the initializer: `"initializer": {"regexes": [..], "prevent_regexes": [regex1, regex2]}. num_serialized_models_to_keep
andkeep_serialized_model_every_num_seconds
used to be able to be passed as top-level parameters to thetrainer
, but now they must always be passed to thecheckpointer
instead. For example, if you had"trainer": {"num_serialized_models_to_keep": 1}
, it now needs to be"trainer": {"checkpointer": {"num_serialized_models_to_keep": 1}}
.- Tokenizer specification changed because of #3361. Instead of something like
"tokenizer": {"word_splitter": {"type": "spacy"}}
, you now just do"tokenizer": {"type": "spacy"}
(more technically: theWordTokenizer
has now been removed, with the things we used to callWordSplitters
now just moved up to be top-levelTokenizers
themselves). - The
namespace_to_cache
argument toElmoTokenEmbedder
has been removed as a config file option. You can still passvocab_to_cache
to the constructor of this class, but this functionality is no longer available from a config file. If you used this and are really put out by this change, let us know, and we'll see what we can do.
Changes:
allennlp make-vocab
andallennlp dry-run
are deprecated, replaced with a--dry-run
flag which can be passed toallennlp train
. We did this so that people might find it easier to actually use the dry run features, as they don't require any config changes etc.- When constructing objects using our
FromParams
pipeline, we now inspect superclass constructors when the concrete class has a**kwargs
argument, adding arguments from the superclass. This means, e.g., that we can add parameters to the baseDatasetReader
class that are immediately accessible to all subclasses, without code changes. All that this requires is that you take a**kwargs
argument in your constructor, and you callsuper().__init__(**kwargs)
. See #3633. - The order of the
num_embeddings
andembedding_dim
constructor arguments forEmbedding
has changed. Additionally, passing apretrained_file
now initializes the embedding regardless of whether it was constructed from a config file, or directly instantiated in code. see #3763 - Our default logging behavior is now much less verbose.
Notable bug fixes
There are far too many small fixes to be listed here, but these are some notable fixes:
- NER interpretations have never actually reproduced the result in the original AllenNLP Interpret paper. This version finds and fixes that problem, which was that the loss masking code to make the model compute correct gradients for a single span prediction was not checked in. See #3971.
Upgrading Guide
Iterators -> DataLoaders
Allennlp now uses Pytorch's API for data iteration, rather than our own custom one. This means that train_data
, validation_data
, iterator
and validation_iterator
arguments to the Trainer
are now deprecated, having been replaced with data_loader
and validation_dataloader
.
Previous config files which looked like:
{
"iterator": {
"type": "bucket",
"sorting_keys": [["tokens"], ["num_tokens"]],
"padding_noise": 0.1
...
}
}
Now become:
{
"data_loader": {
"batch_sampler" {
"type": "bucket",
// sorting keys are no longer required! They can be inferred automatically.
"padding_noise": 0.1
}
}
Multi-GPU
Allennlp now uses DistributedDataParallel
for parallel training, rather than DataParallel
. With DistributedDataParallel
, each worker (GPU) runs in it's own process. As such, each process also has its own Trainer
, which now takes a single GPU ID only.
Previous config files which looked like:
{
"trainer": {
"cuda_device": [0, 1, 2, 3],
"num_epochs": 20,
...
}
}
Now become:
{
"distributed": {
"cuda_devices": [0, 1, 2, 3],
},
"trainer": {
"num_epochs": 20,
...
}
}
In addition, if it is important that your dataset is correctly sharded such that one epoch strictly corresponds to one pass over the data, your dataset reader should contain the following logic to read instances on a per-worker basis:
rank = torch.distributed.get_rank()
world_size = torch.distributed.get_world_size()
for idx, inputs in enumerate(data_file):
if idx % world_size == rank:
yield self.text_to_instance(inputs)
v0.9.0
Main features
- AllenNLP Interpret. This lets you interpret the predictions of any AllenNLP model, using gradient-based visualization and attack techniques. You can (1) explore existing interpretations for models that we have implemented at demo.allennlp.org; (2) easily add interpretations for your own model, either programmatically or in a live demo; and (3) easily add new interpretation methods that can be used with any AllenNLP model.
- Compatibility with
pytorch-transformers
, so you can use RoBERTa or whatever else as your base encoder.
Also of note
- A new, more flexible seq2seq abstraction is available (though, honestly, I think we all agree that fairseq or OpenNMT are better for seq2seq models still).
- When specifying types for registrable items, you can now use a fully-qualified path, like
"my_package.models.my_new_fancy_classifier"
, instead of needing to pass--include-package
everywhere.
Complete commit list
052353e (tag: v0.9.0) bump version number to v0.9.0
ff0d44a (origin/master, origin/HEAD) reversing NER for interpet UI (#3283)
3b22011 Composed Sequence to Sequence Abstraction (#2913)
b85f29c Fix F1Measure returning true positives, false positives, et al. only for the first class (#3279)
64143c4 upgrade to latest pylint (#3266)
d09042e Fix crash when hotflip gets OOV input (#3277)
2a95022 Revert batching for input reduction (#3276)
052e8d3 Reduce number of samples in smoothgrad (#3273)
76d248f Reduce hotflip vocab size, batch input reduction beam search (#3270)
9a67546 fix empty sequence bug (#3271)
87fb294 Update question.md (#3267)
daed835 Fix wrong partition to types in DROP evaluation (#3263)
41a4776 Unidirectional LM doesn't return backward loss. (#3256)
3e0bad4 Minor fixes for interpret code (#3260)
05be16a allow implicit package imports (#3253)
48de866 Assorted fixes for run_with_beaker.py (#3248)
c732cbf Add additive attention & unittest (#3238)
07364c6 Make Instance in charge of when to re-index (#3239)
7b50b69 Replace staticmethods with classmethods (#3229)
7cfaab4 Add ERROR callback event (#2983)
ce50407 Revert "Use an NVIDIA base image. (#3177)" (#3222)
b1caa9e Use an NVIDIA base image. (#3177)
4625a9d Improve check_links.py CI script (#3141)
5e2206d Add a reference to Joe Barrow's blog
27ebcf6 added infer_type_and_cast flags (#3209)
bbaf1fc Benchmark iterator, avoid redundant queue, remove managers. (#3119)
78ee3d8 Targeted hotflip attacks and beam search for input reduction (#3206)
f2824fd Predictors for demo LMs, update for coref predictor (#3202)
d78ac70 Language model classes for making predictions (both masked LM and next token LM) (#3201)
8c06c4b Adding a LanguageModelHead abstraction (#3200)
370d512 Dataset readers for masked language modeling and next-token-language-modeling (#3147)
1eaa1ff Link to Discourse in README
030e28c Revert "Revert "Merge branch 'matt-gardner-transformer-embedder'""
6e1e371 Revert "Merge branch 'matt-gardner-transformer-embedder'"
4c7fa73 Merge branch 'matt-gardner-transformer-embedder'
07bdc4a Merge branch 'transformer-embedder' of https://github.com/matt-gardner/allennlp into matt-gardner-transformer-embedder
993034f Minor fixes so PretrainedTransformerIndexer works with roberta (#3203)
70e92e8 doc
ed93e52 pylint
195bf0c override method
6ec74aa Added a TokenEmbedder for use with pytorch-transformers
fb9a971 code for mixed bert embedding layers (#3199)
0e872a0 Clarify that scalar_mix_parameters takes unnormalized weights (#3198)
23efadd upgrade to pytorch 1.2 (#3182)
155a94e Add DropEmAndF1 metric to init.py (#3191)
7738cb5 Add exist_ok parameter to registrable.register decorator. (#3190)
ce6dc72 Add example of initializing weights from pretrained model to doc (#3188)
817814b Update documentation for bert_pooler.py (#3181)
112d8d0 Bump version numbers to v0.9.0-unreleased
v0.8.5
This is (almost certainly) the last release to support versions of PyTorch earlier than 1.2.0. PyTorch 1.2.0 introduces some breaking changes that require changes to the allennlp library (as well as some new features that will allow us to make allennlp better), and so you should expect that the next release will require torch>=1.2.0
This introduces some changes to the CallbackTrainer
, which moves much of the ancillary training behavior into configurable Callbacks. Its API should still be considered somewhat experimental; in particular, we are open to feedback on its design decisions. There are no current plans to get rid of the classic Trainer
, although it is likely to get more and more unwieldy as we add new training functionality to the library.
This also includes the "srl_bert" model as featured in the allennlp demo, as well as many other fixes.
Full list of commits:
c8d7327 (tag: v0.8.5) bump version number to v0.8.5
9d8d36a quoref metric and evaluator (#3153)
7bacfad Set pytorch-transformer to 1.1.0 (#3171)
18daa29 Fix pearson correlation.py (#3101)
e641543 Add a test for the subcommands docstring help outputs (#3172)
b6af6eb bug fix for default tokenization of knowledge graph entities (#3170)
0f6b3b8 Make SrlBert model use SrlEvalMetric (#3168)
adad1bc Switch SemanticRoleLabeler metric to SrlEvalScorer. (#3164)
8fffd83 Add missing train command cache options (#3160)
770791a switch for DataIterators whether smaller batches should be skipped (#3140)
111db19 Create method to save instances to cache file. (#3131)
dac486e Fixing NaN warning with single element tensors (#3158)
6bbf82ef Move matplotlib import into function (#3157)
3ef43c9 Upgrade minimum spacy to 2.1.0 (#3152)
1d6e166 Fixing Conda download and install link on Readme (#3151)
f111d8a Pretrained transformer indexer (#3146)
fa1ff67 Add support for running preemptible workloads on beaker (#3143)
0bd3319 Add regularization parameter to Models (#3120)
bf968c6 add keep_as_dict option to Params.pop, use in Vocab and automat… (#3075)
217022f Adding a PretrainedTransformerTokenizer (#3145)
f9e2029 Update HTTP links to HTTPS where possible (#3142)
9093f47 Add dropout option for BERT Pooler (#3109)
23a089c pin pytorch away from 1.2 until we fix the tests (#3128)
0be4b48 fix UpdateMovingAverage.from_params (#3126)
9db0042 Upgrade conllu from 0.11 to 1.3.1 (#3115)
6746d12 pass cache_directory and cache_prefix to non-default trainers (#3077)
031bbf9 allenNLP broken link (#3086)
0caf364 Adding cached_path to input file of the predictor (#3098)
417a757 Adding ability to choose validation DatasetReader with "predict" (#3033)
30c4271 Close tensorboard's event files properly at the end of the training (#3085)
428c151 fix MetadataField.batch_tensors (#3084)
88a61e1 [Embedding] Forward given padding_index param to embedding() (#2504)
9ed9e2c remove executable permission for submods (#3080)
a1476c0 add equality check for index field; allennlp interpret (#3073)
5014d02 remove deprecated function call in hotflip (#3074)
014fe31 Improve dict missing key code (#3071)
9166c18 AllenNLP Interpret Basic Version (#3032)
7728b12 Minor WTQ ERM model and dataset reader fixes for demo (#3068)
ec30c90 remove dropout from test fixtures (#2889)
1cd2193 Revert "Lr scheduler bug" (#3065)
083f343 Lr scheduler bug (#2905)
5c64f9d warn about truncation only once (#3052)
ebe9113 Add options for focal loss (#3036)
c22ed57 Spacy token indexer (#3040)
a33436d Multilabel bug (#3021)
dd3476f simplify callback trainer (#3029)
0663e0b Fixes to ERM decoding script (#3041)
64d16ac fix shape comments (#3025)
354a19b Update documents to sentence_splitter.py (#3023)
427996d Fix forward in EndpointSpanExtractor (#3042)
38a1073 Update README.md
715422c Fix error in BooleanAccuracy when total count is 0 (#2991)
9e52e0f Remove separate start type prediction in state machines (#3030)
c2c4b64 Remove awscli from dependencies (#3024)
ae72049 Make per-batch logging quieter. (#3020)
57870f5 removing unnecessary data iteration (#3027)
e71618d Allow option to only reset some states in _EncoderBase (#2967)
15a9cbe PassThroughIterator (#3015)
70fa4aa Typo in initializers.py (#3016)
c85dcfc fix behavior when num_serialized_models_to_keep is 0 (#2880)
2cce412 fix type in vocab config (#2977)
7943b2f Expose the spacy language model for the word splitter in the se… (#3008)
9b929b2 Fixes inconsistent resetting of metrics with Validate and TrackMetrics callbacks (issue #3001) (#3002)
0a26739 Modified ActionSpaceWalker to use DomainLanguage (#3006)
9a13ab5 (vikigenius/master) do not evaluate after training if non-default trainer (#2997)
30ffaa5 Removed target_token_indexers documentation (#2990)
7b465c6 Clarify docs for CosineWithRestarts (#2953)
fb87872 Add never_split signature (#2463)
b6a2abb correct the wrong parameter note(target_namespace) (#2987)
cf247c6 Add model parameters / modules inspection helper. (#2466)
0fbd1ca WTQ dataset reader and models moved from iterative-search-semparse (#2764)
7e08298 Fix wordpiece indexer truncation (#2931)
03aa838 fix incorrect logging in viterbi decode (#2982)
51b74b1 use starting offsets in the srl model so output is wellformed (#2972)
bcd1070 Add cuda_device to Predictor.from_path (#2974)
459633c rename callbacks (#2966)
0f1bc3b DeprecationWarning removed for op-level token_embedders (#2955)
a655ad5 experimental: add backoff (#2968)
2a59be3 (epwalsh/master) Change registered names of scheduler callbacks (#2964)
2a88450 update sphinx version (#2959)
165b282 Bert srl model (#2961)
3dc99a7 Add missing param in CallbackTrainer.init docstring (#2960)
9e6e0af callback based trainer (#2817)
1b656dd Allowing for bulk adding of tokens to vocab (#2948)
c9eb2d0 Replace current default stopword list with spaCy's. (#2940)
6a3d3a8 ensure regularizers are only computed for parameters requiring gradients (#2887)
acfbb8c Replace s3 path style to virtual host style (#2873)
ac72c88 Fix Model.load fail if model_params is str (#2805)
e7b0013 Linear assignment depricated fix (#2950)
01602c8 Link to Wikitables and ATIS data. (#2947)
eaebe02 Update issue templates to request full stacktrace (#2876)
da16ad1 Multilingual parser and Cross-lingual ELMo (#2628)
92ee421 Fix for cyclic import problems (issue #2935) (#2938)
5e3c4cd token_type_ids fix for window reshaping (#2942)
44ba490 Use unsigned s3 requests when missing credentials (#2939)
c0f44f7 Fixed minor error while calculating span accuracy (#2923)
8e180cb Pad coreference model input to 5 (#2933)
9a0e01f fix trainable and requires_grad kwargs (#2932)
5f37783 Bert srl (#2854)
5b2066b Fix Invalid Index Reference for labels in Vocabulary (#2926)
c629093 CopyNet: replace in-place tensor operation with out-of-place equivalent (#2925)
89700de Change image to docker_image (#2918)
5393882 Change blueprint to image in run_with_beaker (#2903)
6afba9a Fix TextField padding when there are no tokens (#2843)
954a02f (vikigenius/release-v0.8.4, upstream/release-v0.8.4) Bump version numbers to v0.8.5-unreleased