AraT5: Text-to-Text Transformers for Arabic Language Generation

This is the repository accompanying our paper AraT5: Text-to-Text Transformers for Arabic Language Understanding and Generation. In this is the repository we introduce:

Introduce AraT5_MSA, AraT5_Tweet, and AraT5: three powerful Arabic-specific text-to-text Transformer based models;
Introduce ARGEN: A new benchmark for Arabic language generation and evaluation for four Arabic NLP tasks, namely, machine translation, summarization, news title generation, question generation, , paraphrasing, transliteration, and code-switched translation.
Evaluate AraT5 models on ARGEN and compare against available language models.

Our models establish new state-of-the-art (SOTA) on several publicly available datasets. Our language models are publicaly available for research (see below).

The rest of this repository provides more information about our new language models, benchmark, and experiments.

🔆 Breaking News! 🔆

We're eagled to announce the next version of AraT5

🔥 What's new? 🔥

More Data. AraT5v2 is trained on large and more diverse Arabic data.
Larger Sequence Length. We increase the sequence length from 512 to 1024 in this version.
Faster Convergence. On finetuning process, AraT5v2 converges ~10x faster than the previous version (AraT5-base).
Extra IDs. AraT5v2 supports 100 sentinel tokens (a.k.a unique mask tokens).

🤗 Hugging Face:https://huggingface.co/UBC-NLP/AraT5v2-base-1024

We recommend using AraT5v2-base-1024 instead of the previous version (AraT5-base).

1. Our Language Models

1.1 Training Data

MSA Training Data: We use 70GB of MSA text 7.1B tokens) from the following sources: AraNews, El-Khair, Gigaword, OSCAR, OSIAN, Wikipedia Arabic, and Hindawi Books.
Twitter Training Data: We randomly sample 1.5B Arabic tweets from a large in-house dataset of about 10B tweets. We use string matching to only include tweets with at least 3 Arabic words, regardless whether the tweet has non-Arabic string or not. The dataset makes up 178GB of text 21B tokens.

1.2 Models Architecture

To train our AraT5, we use the same architecture as T5-base and T5-small (Raffel 2019) where both encoder and decoder has 12 layers each with 12 attention heads, and 768 hidden units.

1.3 AraT5 Models

We pre-train three powerful variants of the text-to-text transformer (T5) model dedicated to Modern Standard Arabic (MSA) and Arabic dialects, AraT5. AraT5 comes. AraT5 comes in three flavors:

AraT5_MSA: trained on MSA data exclusively
AraT5_Tweet: trained on Twitter data (mix of MSA and dialectal Arabic),
AraT5: trained on both Twitter and MSA data.

2. ARGEN Benchmark and AraT5 Evaluation

To evaluate our models, we also introduce ARGEN, a new benchmark for A new benchmark for Arabic language generation and evaluation. ARGEN is composed of four tasks, namely, machine translation, summarization, newstitle generation and question generation. ARGEN is collected from a total of ten datasets, including two new large datasets pro-posed in this work.

2.1 Machine Translation

2.1.1 MSA To English

Dataset	Test Split	mT5	AraT5_Tweet	AraT5_MSA	AraT5
Bible II Sajjad et al. (2020)	Test 1	15.58	13.04	16.38	15.71
Bible II Sajjad et al. (2020)	Test 2	12.1	9.2	12.53	11.64
MADAR Bouamor et al. (2018)	MSA-EN	11.84	11.11	11.42	10.57
IWSLT Cettolo et al. (2016)	TED15	29.39	28.2	30.37	30.45
IWSLT Cettolo et al. (2016)	TED16	28.39	27.03	29.37	29.18
IWSLT Cettolo et al. (2016)	QED16	21.09	18.55	20.98	19.11
UN Ziemski et al. (2016)	AR-EN	52.38	51.48	53.29	52.96

Metric is BLEU. MADAR Bouamor et al. (2018) (25 datasets) results are show in Table 6 (see the paper)

2.1.2 Dialictal Arabic To English

Dataset	Test Split	mT5	AraT5_Tweet	AraT5_MSA	AraT5
ADPT Zbib et al. (2012)	Lev	8.33	8.32	8.52	8.42
ADPT Zbib et al. (2012)	Egy	12.57	11.25	12.38	12.92
Bible I Sajjad et al. (2020)	Tun	8.08	5.86	8.52	7.94
Bible I Sajjad et al. (2020)	Mor	7.21	4.69	7.83	6.82
QAraCy Sajjad et al. (2020)	Qat	11.84	11.11	11.42	10.57

Metric is BLEU.

2.1.3 Foreign languages To MSA

Spit	mT5	AraT5_MSA
EN → MSA	17.80	18.58
DE → MSA	11.92	12.80
FR → MSA	18.61	18.99
RU → MSA	26.63	28.01

Metric is BLEU. All the splits are from UN corpus Ziemski et al. (2016)

2.2 Text Summarization

Metric	Metric	mT5	AraT5_Tweet	AraT5_MSA	AraT5
	Rouge1	62.98	60.74	59.54	54.61
EASC El-Haj et al. (2010)	Rouge2	51.93	48.89	47.37	43.58
	RougeL	62.98	60.73	59.55	54.55
	Rouge1	71.63	74.61	72.64	73.48
WikiLin Alami et al. (2021)	Rouge2	63.60	67.00	64.21	65.09
	RougeL	71.56	74.52	72.57	73.37

2.3 News Title and Question Generation

Dataset	Metric	mT5	AraT5_Tweet	AraT5_MSA	MSA
ARGEN_NTG Nagoudi et al., 2020	BLEU	19.49	20.00	20.61	20.51
ARGEN_QG Nagoudi et al. (2021)	BLEU	15.29	12.06	14.18	16.99

2.4 Paraphrasing and Transliteration

Dataset	Metric	mT5	AraT5_Tweet	AraT5_MSA	MSA
ARGEN_{PPH I} Cer et al. (2017)	BLEU	19.32	18.17	19.38	19.03
ARGEN_{PPH II} Alian et al. (2021)	BLEU	19.25	17.34	19.43	18.42
ARGEN_TR Song et al. (2014)	BLEU	60.81	59.55	65.88	62.51

2.5 Code-Switched Translation

Dataset	Type	mT5	AraT5_Tweet	AraT5_MSA	MSA
ALG-FR → FR	Natural	23.83	28.19	26.27	26.17
JOR-EN → EN	Natural	23.06	21.60	21.58	20.45
MSA-FR → FR	Synthetic	11.06	8.99	11.53	11.42
MSA-EN → EN	Synthetic	19.25	17.34	19.43	18.42
MSA-FR → MSA	Synthetic	12.93	12.14	14.39	13.92
MSA-EN → MSA	Synthetic	19.82	18.43	23.89	24.37

Metric is BLEU. All the ARGEN_CS datasets are from: Nagoudi et al. (2021)

3. How to use AraT5 model

Below is an example for fine-tuning AraT5-base for News Title Generation on the Aranews dataset

!python run_trainier_seq2seq_huggingface.py \
        --learning_rate 5e-5 \
        --max_target_length 128 --max_source_length 128 \
        --per_device_train_batch_size 8 --per_device_eval_batch_size 8 \
        --model_name_or_path "UBC-NLP/AraT5-base" \
        --output_dir "/content/AraT5_FT_title_generation" --overwrite_output_dir \
        --num_train_epochs 3 \
        --train_file "/content/ARGEn_title_genration_sample_train.tsv" \
        --validation_file "/content/ARGEn_title_genration_sample_valid.tsv" \
        --task "title_generation" --text_column "document" --summary_column "title" \
        --load_best_model_at_end --metric_for_best_model "eval_bleu" --greater_is_better True --evaluation_strategy epoch --logging_strategy epoch --predict_with_generate\
        --do_train --do_eval

For the more details about the fine-tuning example, please read this notebook

In addition, we release the fine-tuned checkpoint of the News Title Generation (NGT) which is described in the paper. The model available at Huggingface (UBC-NLP/AraT5-base-title-generation).

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("UBC-NLP/AraT5-base-title-generation")  
model = AutoModelForSeq2SeqLM.from_pretrained("UBC-NLP/AraT5-base-title-generation")

Document = "تحت رعاية صاحب السمو الملكي الأمير سعود بن نايف بن عبدالعزيز أمير المنطقة الشرقية اختتمت غرفة الشرقية مؤخرا، الثاني من مبادرتها لتأهيل وتدريب أبناء وبنات المملكة ضمن مبادرتها المجانية للعام 2019 حيث قدمت 6 برامج تدريبية نوعية. وثمن رئيس مجلس إدارة الغرفة، عبدالحكيم العمار الخالدي، رعاية سمو أمير المنطقة الشرقية للمبادرة، مؤكدا أن دعم سموه لجميع أنشطة ."

encoding = tokenizer.encode_plus(Document,pad_to_max_length=True, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"], encoding["attention_mask"]


outputs = model.generate(
    input_ids=input_ids, attention_mask=attention_masks,
    max_length=256,
    do_sample=True,
    top_k=120,
    top_p=0.95,
    early_stopping=True,
    num_return_sequences=5
)

for id, output in enumerate(outputs):
    title = tokenizer.decode(output, skip_special_tokens=True,clean_up_tokenization_spaces=True)
    print("title#"+str(id), title)

The input news document

تحت رعاية صاحب السمو الملكي الأمير سعود بن نايف بن عبدالعزيز أمير المنطقة الشرقية اختتمت غرفة الشرقية مؤخرا، الثاني من مبادرتها لتأهيل وتدريب أبناء وبنات المملكة ضمن مبادرتها المجانية للعام 2019 حيث قدمت 6 برامج تدريبية نوعية. وثمن رئيس مجلس إدارة الغرفة، عبدالحكيم العمار الخالدي، رعاية سمو أمير المنطقة الشرقية للمبادرة، مؤكدا أن دعم سموه لجميع أنشطة .

The generated titles

title#0 غرفة الشرقية تختتم المرحلة الثانية من مبادرتها لتأهيل وتدريب أبناء وبنات المملكة
title#1 غرفة الشرقية تختتم الثاني من مبادرة تأهيل وتأهيل أبناء وبناتنا
title#2 سعود بن نايف يختتم ثانى مبادراتها لتأهيل وتدريب أبناء وبنات المملكة
title#3 أمير الشرقية يرعى اختتام برنامج برنامج تدريب أبناء وبنات المملكة
title#4 سعود بن نايف يرعى اختتام مبادرة تأهيل وتدريب أبناء وبنات المملكة

4. Ethics

Our models are developed using data from the public domain. We provide access to our models to accelerate scientific research with no liability on our part. Please use our models and benchmark only ethically. This includes, for example, respect and protection of people's privacy. We encourage all researchers who decide to use our models to adhere to the highest standards. For example, if you apply our models on Twitter data, we encourage you to review Twitter policy at Twitter policy. For example, Twitter provides the following policy around use of sensitive information:

Sensitive information

You should be careful about using Twitter data to derive or infer potentially sensitive characteristics about Twitter users. Never derive or infer, or store derived or inferred, information about a Twitter user’s:

Health (including pregnancy)
Negative financial status or condition
Political affiliation or beliefs
Racial or ethnic origin
Religious or philosophical affiliation or beliefs
Sex life or sexual orientation
Trade union membership
Alleged or actual commission of a crime
Aggregate analysis of Twitter content that does not store any personal data (for example, user IDs, usernames, and other identifiers) is permitted, provided that the analysis also complies with applicable laws and all parts of the Developer Agreement and Policy.

5. AraT5 Models Checkpoints

AraT5 Pytorch and Tenserflow checkpoints are available on Huggingface website for direct download and use exclusively for research. For commercial use, please contact the authors via email @ (*muhammad.mageed[at]ubc[dot]ca*).

Model	Link
AraT5-base	https://huggingface.co/UBC-NLP/AraT5-base
AraT5-msa-base	https://huggingface.co/UBC-NLP/AraT5-msa-base
AraT5-tweet-base	https://huggingface.co/UBC-NLP/AraT5-tweet-base
AraT5-msa-small	https://huggingface.co/UBC-NLP/AraT5-msa-small
AraT5-tweet-small	https://huggingface.co/UBC-NLP/AraT5-tweet-small
Title generation model	https://huggingface.co/UBC-NLP/AraT5-base-title-generation
🔥AraT5v2-base-1024🔥	https://huggingface.co/UBC-NLP/AraT5v2-base-1024

6. Citation

If you use our AraT5 models for your scientific publication, or if you find the resources in this repository useful, please cite our papers as follows:

AraT5v1 Models

@inproceedings{nagoudi-etal-2022-arat5,
    title = "{A}ra{T}5: Text-to-Text Transformers for {A}rabic Language Generation",
    author = "Nagoudi, El Moatez Billah  and
      Elmadany, AbdelRahim  and
      Abdul-Mageed, Muhammad",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.47",
    pages = "628--647",
}

AraT5v2 Models

@inproceedings{elmadany-etal-2023-octopus,
    title = "Octopus: A Multitask Model and Toolkit for {A}rabic Natural Language Generation",
    author = "Elmadany, AbdelRahim  and
      Nagoudi, El Moatez Billah  and
      Abdul-Mageed, Muhammad",
    booktitle = "Proceedings of ArabicNLP 2023",
    month = dec,
    year = "2023",
    address = "Singapore (Hybrid)",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.arabicnlp-1.20",
    doi = "10.18653/v1/2023.arabicnlp-1.20",
    pages = "232--243",
}

7. Acknowledgments

We gratefully acknowledge support from the Natural Sciences and Engineering Research Council of Canada, the Social Sciences and Humanities Research Council of Canada, Canadian Foundation for Innovation, ComputeCanada and UBC ARC-Sockeye. We also thank the Google TensorFlow Research Cloud (TFRC) program for providing us with free TPU access.

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
examples		examples
AraT5_CR_new.png		AraT5_CR_new.png
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AraT5: Text-to-Text Transformers for Arabic Language Generation

🔆 Breaking News! 🔆

🔥 What's new? 🔥

Table of Contents

1. Our Language Models

1.1 Training Data

1.2 Models Architecture

1.3 AraT5 Models

2. ARGEN Benchmark and AraT5 Evaluation

2.1 Machine Translation

2.1.1 MSA To English

2.1.2 Dialictal Arabic To English

2.1.3 Foreign languages To MSA

2.2 Text Summarization

2.3 News Title and Question Generation

2.4 Paraphrasing and Transliteration

2.5 Code-Switched Translation

3. How to use AraT5 model

4. Ethics

Sensitive information

5. AraT5 Models Checkpoints

6. Citation

7. Acknowledgments

About

Releases

Packages

Contributors 3

UBC-NLP/araT5

Folders and files

Latest commit

History

Repository files navigation

AraT5: Text-to-Text Transformers for Arabic Language Generation

🔆 Breaking News! 🔆

🔥 What's new? 🔥

Table of Contents

1. Our Language Models

1.1 Training Data

1.2 Models Architecture

1.3 AraT5 Models

2. ARGEN Benchmark and AraT5 Evaluation

2.1 Machine Translation

2.1.1 MSA To English

2.1.2 Dialictal Arabic To English

2.1.3 Foreign languages To MSA

2.2 Text Summarization

2.3 News Title and Question Generation

2.4 Paraphrasing and Transliteration

2.5 Code-Switched Translation

3. How to use AraT5 model

4. Ethics

Sensitive information

5. AraT5 Models Checkpoints

6. Citation

7. Acknowledgments

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Packages