BERTweet is the first public large-scale pre-trained language model for English tweets. It outperforms state-of-the-art models on three Tweet NLP tasks: POS tagging, NER and text classification.
BERTweet contains a 80GB corpus and 850M English tweets. It is trained using the RoBERTa pre-training procedure, with the same model configuration as BERT base.
Preprocessing information in the 'pre-training data' section. Tokenization using TweetTokenizer from NLTK, emoji package, tweet normalization. After that, we use fastBPE to segment the tweets with subword units, using a vocabulary of 64K subword types. On average, there are 25 subword tokens per tweet.
Our results comparing the “soft” and “hard” normalization strategies with regards to the pretrained language models confirm the previous view that lexical normalization on Tweets is a lossy translation task.
RoBERTa uses 2x more data than BERTweet, and XLM 3.75x more. Yet BERTweet outperforms them. Thus this confirms the effectiveness of a large-scale and domain-specific pre-trained language model for English Tweets.
In future work, we will release a “large” version of BERTweet.