This repository holds the code and data (Wiki-Convert) for our EMNLP 2021 short paper. We show that magnitude-aware number encoders help language models predict words better, and the results transfer to non-numeric contexts as well. Here are some links to better understand our work:
Anthology | PDF | Slides | Video | Poster | Twitter thread | ACL21 Reviews
Please reach out to me at thawani@usc.edu
in case you face any issues or just to chat!
Wiki-Convert: A novel dataset of Wikipedia sentences annotated with numbers. The easiest way to get the data is via Huggingface Datasets library. Simply install the datasets library and run:
from datasets import load_dataset
ds = load_dataset("usc-isi/WikiConvert")
Example:
id | comment | offset | length | number |
---|---|---|---|---|
0 | With a total of 1500 miles of inland waterways, Alabama has among the most of any state. | 16 | 4 | 1500 |
Here, the Wikipedia sentence is provided under the key comment
and the annotated number
is provided via its character offset
and length
, i.e., comment[offset:offset+length] = number
. You will find additional keys UNIQUE_STORY_INDEX
and magnitude
which are irrelevant and were simply added for consistency with the format of the Numeracy600K dataset.
Note that when loading from the Datasets library, numbers larger than sys.maxsize
will be capped to avoid an overflow in PyArrow. For the uncapped version, you may download the json files directly for the train, dev, and test splits.
The dataset sizes are as follows:
Train | Dev | Test | |
---|---|---|---|
# examples | 739583 | 92447 | 92449 |
file size (MBs) | 169 | 20.9 | 20.5 |
If you prefer the NUM
and UNIT
annotations as described in the paper, here is a 233 MB json file. You may also retrieve a larger, unprocessed version of the data at this link.
train.py: model description and training
nice python train.py --batch-size 256 --gpus 0, --tsamples 100_000 --dsamples 10_000 --max_epochs 10 --enc exp --hidden 200 --accumulate_grad_batches 4 --seed 0 --dataset WC
eval.py: reports perplexity and hit@k scores
nice python eval.py --limit 10_000 --ckpt checkpoints/read-WC-def-adj-noun/epoch=9.ckpt --maxtoks 150 --batch-size 128 --device 0
dataset.py: tokenized dataset description
valids/common...txt: list of sentence indices to evaluate
Here's how to cite us for the results or the Wiki-Convert dataset:
@inproceedings{thawani-etal-2021-numeracy,
title = "Numeracy enhances the Literacy of Language Models",
author = "Thawani, Avijit and
Pujara, Jay and
Ilievski, Filip",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2021",
address = "Online and Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.emnlp-main.557",
pages = "6960--6967",
abstract = "Specialized number representations in NLP have shown improvements on numerical reasoning tasks like arithmetic word problems and masked number prediction. But humans also use numeracy to make better sense of world concepts, e.g., you can seat 5 people in your {`}room{'} but not 500. Does a better grasp of numbers improve a model{'}s understanding of other concepts and words? This paper studies the effect of using six different number encoders on the task of masked word prediction (MWP), as a proxy for evaluating literacy. To support this investigation, we develop Wiki-Convert, a 900,000 sentence dataset annotated with numbers and units, to avoid conflating nominal and ordinal number occurrences. We find a significant improvement in MWP for sentences containing numbers, that exponent embeddings are the best number encoders, yielding over 2 points jump in prediction accuracy over a BERT baseline, and that these enhanced literacy skills also generalize to contexts without annotated numbers. We release all code at https://git.io/JuZXn.",
}