One of the most significant tasks in natural language processing (NLP) is text generation, which beneficiate from the recent architectures that use large pre-trained transformer models, such as the Generative Pre-trained Transformer-2 (GPT-2) or GPT-3 developed by OpenAI, and Google’s Bidirectional Encoder Representations from Transformers (BERT). The paper presents a NLG model based on the GPT-2 architecture that generates Romanian online news, using manually annotated texts. A small Romanian GPT-2 model, using 24 thousand news items, named MCBGPT-2 was developed, tested and evaluated. Additionally, an existing Romanian GPT-2 model, called RoGPT-2, was added to experiments. For evaluation, is presented a comparison of several automatic metrics such as BLEU, ROUGE, BLEURT and BERTScore applied to generated news items from the test and validation datasets. Experimental results revealed that the MCBGPT-2 and RoGPT-2 models provided similar performances in text generation task for Romanian language, using less data for MCBGPT-2 model’s training process.
Figure 1 presents the top news sources and the distribution of number of articles from each source, such as agerpres.ro – 2320, monitorulapararii.ro – 720, news.ro – 614 or evz.ro – 536 articles. The aforementioned dataset was split into train (60%), validation (20%), and test (20%). The training process of the MCBGPT-2 model used 14,756 news items, while 4922 news items in each of the test and validation datasets were presented.
Table 1 and Figure 2 presents the distributions of the test, validation, and train datasets, revealing that our datasets contain the same average words per news item (e.g., ~200 words) and the vocabulary size (unique words from The Explanatory Dictionary of the Romanian Language - DEX) of the test and validation datasets are well balanced (e.g., ~1M words), while the training dataset is about three times larger (e.g., ~3M words).
Table 1 Distribution of words across test, validation and train datasets
Words | Dataset | No | Words | Dataset | No |
---|---|---|---|---|---|
Test | 957,787 | Test | 194 | ||
Romanian unique words | Validation | 1,026,680 | Average words per news item | Validation | 208 |
Train | 2,837,236 | Train | 192 |
The experiments were performed on a single server using tf.distribute.Mirrored Strategy, a TensorFlow API to distribute training across multiple GPUs and several software or packages such as CUDA (vers. 11.2), Python (vers. 3.9.7) and TensorFlow (vers. 2.4.1).The proposed MCBGPT-2 model was trained and validated using the aforementioned news corpus, that contain 18 fake news and 24,582 true news, 368 news with negative polarity and 2135 news with positive polarity, being split in equal parts between train, test and validation datasets. For this research, the Adam optimizer is used with a small learning rate (e.g., 3 x 105), a loss function such as sparse categorical cross-entropy and a vocabulary size of 60,000 words. Several training statistics of the proposed model, including the hyperparameters and training time, are presented in Table 2.
Table 2 Training statistics of MCBGPT-2 model
Parameters name | Value of parameter |
---|---|
Number of parameters | 131M |
Number of epoch | 15 |
Duration of an epoch | 5h |
Context size | 512 |
Batch size | 12 |
block_size = 512
BATCH_SIZE = 12
BUFFER_SIZE = 1000
# defining our optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
# definining our loss function
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
# defining our metric which we want to observe
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
# compiling the model
model.compile(optimizer=optimizer, loss=[loss, *[None] * model.config.n_layer], metrics=[metric])
num_epoch = 15
history = model.fit(dataset, epochs=num_epoch)
from transformers import WEIGHTS_NAME, CONFIG_NAME
output_dir = configuration.get_property('PYTHON_DIR') + "/gpt_models/mcb_model"
# creating directory if it is not present
if not os.path.exists(output_dir):
os.mkdir(output_dir)
model_to_save = model.module if hasattr(model, 'module') else model
output_model_file = os.path.join(output_dir, WEIGHTS_NAME)
output_config_file = os.path.join(output_dir, CONFIG_NAME)
# save model and model configs
model.save_pretrained(output_dir)
model_to_save.config.to_json_file(output_config_file)
# save tokenizer
tokenizer.save_pretrained(output_dir)
output_dir = configuration.get_property('PYTHON_DIR') + "/gpt_models/mcb_model/"
CUDA_LAUNCH_BLOCKING = 1
# TensorFlow
from transformers import AutoTokenizer, TFAutoModelForCausalLM
results = sqlI.select_all_rows_from_table(['columns'], 'table', 'database', None, 'primaryKey DESC')
tokenizer = AutoTokenizer.from_pretrained(output_dir)
model = TFAutoModelForCausalLM.from_pretrained(output_dir)
updated_texts = dict()
for result in results:
updated_texts.update({result[0]: configuration.clean_text(result[1])})
for id_single, text in updated_texts.items():
max_length = 4000
min_length = 600
inputs = tokenizer.encode(text, return_tensors='tf', max_length=512, truncation=True)
text_predicted = model.generate(inputs, max_length=int(max_length), min_length=int(min_length),
no_repeat_ngram_size=2,
temperature=0.8, num_beams=5, num_return_sequences=5, )
new_text = tokenizer.decode(text_predicted[0], skip_special_tokens=True).replace('<|endoftext|>', '')
new_text = new_text.replace("'", ' ')
sqlI.update_field_for_all_rows_from_table(['column'], "'" + new_text + "'",
'table',
'primaryKey="' + str(id_single) + '" AND primaryKey > 0',
'database').
For the experiments, the automatic metrics have the following input parameters:
- Bilingual Evaluation Understudy Score (BLEU) – only one reference was used, the maximum n-gram length was set to four and no pre-processing techniques were applied.
- Recall-Oriented Understudy for Gisting Evaluation (ROUGE) – the best values was achieved by measuring the match-rate of unigrams (ROUGE-1).
- Bilingual Evaluation Understudy with Representations from Transformers (BLEURT) – a BLEURT checkpoint was used – a self-contained folder that contained a regression model which was tested on several languages, but should work for the 100+ languages of multilingual C4 (a cleaned version of Common Crawl’s web crawl corpus), including the Romanian language with 45M of training and 45K of validation examples. Specifically, BLEURT-20 was used as checkpoint, being a 32 layers pre-trained transformer model, named RemBERT, which contained 579M parameters fine-tuned on human ratings and synthetic data (~590K sentence pairs) collected during years 2015 and 2019 from WMT Metrics Shared Task.
- BERTScore – “bert-base-multilingual-cased” was used as model type, a 12-layer transformer with token embeddings of size 768, trained by Google on the Wikipedia dumps from 104 languages, including Romanian which was explicitly added (e.g., lang=“ro”) and the selected number of layers was 9 (e.g., num_layers=9).
Additional metrics used in our experiments (See dataset)
METEOR (Metric for Evaluation of Translation with Explicit ORdering)
GLEU metric (General Language Evaluation Understanding)
output_dir = configuration.get_property('PYTHON_DIR') + "/gpt_models/bert-base-multilingual-cased/"
tokenizer = AutoTokenizer.from_pretrained(output_dir)
model = AutoModelWithLMHead.from_pretrained(output_dir)
results = sqlI.select_all_rows_from_table(['sql columns'],
'table',
'database', None, 'primaryKey ASC')
stop = stopwords.words('romanian')
if os.path.isabs(configuration.get_property('CONFIG_PATH') + os.sep + 'stop_words.txt'):
with open(configuration.get_property('CONFIG_PATH') + os.sep + 'stop_words.txt', 'r') as opened_file:
lines = opened_file.read().split("\n")
for line in lines:
if configuration.clean_text(line) not in stop:
stop.append(configuration.clean_text(line))
opened_file.close()
for id_single, text, new_text in results:
try:
P, R, F1 = score([text], [new_text], model_type=output_dir, num_layers=9, lang="ro", verbose=True)
final = F1.mean()
value = final.item()
sqlI.update_field_for_all_rows_from_table(['sql columns'], value,
'table',
'primaryKey="' + str(id_single) + '" AND primaryKey > 0',
'database')
Table 3 Scores of generated news items using RoGPT-2 and MCBGPT-2 models for the test and validation dataset
Model name | Dataset | BLEU | ROUGE | BLEURT | BERTScore |
---|---|---|---|---|---|
RoGPT-2 | Test | 32.29 | 0.53 | 0.68 | 0.8106 |
Validation | 28.70 | 0.50 | 0.52 | 0.8136 | |
MCBGPT-2 | Test | 8.79 | 0.25 | 0.63 | 0.8124 |
Validation | 9.11 | 0.14 | 0.50 | 0.8277 |
This paper was focused on the quantitative evaluation (Figure 4, Figure 5, Figure 6, Figure 7) using different automatic metrics, and future studies will include only qualitative evaluation
In this research, a new GPT-2 architecture have been tested and evaluated, generating news items based on short Romanian text prompts. Using automatic metrics such as BLEU, ROUGE, BLEURT and BERTScore, the MCBGPT-2 and RoGPT-2 models are compared, thus, providing another solution for Romanian text generation systems.
The MCBGPT-2 model achieves slightly better scores than RoGPT-2 model for the BERTScore metric when are evaluated larger sentences (Table 4), considering the quantitative evaluation.
Table 4 Examples of Romanian generated sentences
State | Text | |||
Original | Coordonatorul campaniei naționale de vaccinare, medicul militar Valeriu Gheorghiță a vorbit despre finalul pandemiei în România.Întrebat, într-o emisiune televizată, când crede că se va reveni la viața normală, doctorul Gheorghiță a spus Eu sper căt mai repede. Dacă mă întrebați de o lună anume, cred că undeva în iulie august ușor ușor vom intra într-o normalitate care va fi din ce în ce mai concretizată pe măsură ce nivelul de acoperire vaccinală va crește undeva peste a spus medicul Valeriu Gheorghiță... | |||
Link | https://www.ziarulprofit.ro/index.php/cand-se-va-termina-pandemia-in-romania-doctorul-valeriu-gheorghita-da-raspunsul-video/ | |||
Generated text | ||||
RoGPT-2 model | ….a spus medicul Valeriu Gheorghiță, care a mai spus că s-a întâlnit cu ministrul sănătății, Sorina Pintea, pentru a discuta despre starea de sănătate a populației și despre măsurile care se vor lua pentru ca să nu se mai ajungă la astfel de situații, iar ministrul a promis că va avea o discuție cu el și va vedea care sunt măsurile pe care ministerul le va lua, dar și ce măsuri vor fi luate de către minister, atât pentru prevenirea, cât și pentru tratarea celor care au fost depistați cu COVID-19 și care nu au putut fi vaccinați, precum și măsuri de combatere a epidemiei de coronavirus și a altor boli care pot fi transmise prin contact direct cu o persoană infectată cu virusul imunodeficienței umane (HIV). | |||
Metrics | BLEU | ROUGE | BLEURT | BERTScore |
Scores | 22.43 | 0.39 | 0.83 | 0.8487 |
MCBGPT-2 model | ….a spus medicul Valeriu Gheorghiță. În ceea ce privește vaccinarea cu cea de-a treia doză de vaccin, medicul a explicat că, într-adevăr, nu se pune problema de a face o formă gravă de boală, dar nu este un lucru foarte important pentru noi toți cei care se vaccinează cu prima schemă cu vaccinul de la compania Johnson Johnson, pentru că nu avem niciun fel de imunitate de grup peste 40%, iar pentru persoanele care au fost vaccinate cu schemă completă au un risc scăzut de cel puțin 20 de ori să se infecteze și cu o creștere a numărului de persoane care s-au vaccinat cu schema completă. De asemenea, el a mai spus că este important să avem o rată de pozitivare a testelor de maxim 20 la sută, a adăugat medicul. | |||
Metrics | BLEU | ROUGE | BLEURT | BERTScore |
Scores | 20.74 | 0.38 | 0.75 | 0.8588 |
Automatic Fake News Detection for Romanian Online News
Automatic Irony Detection for Romanian Online News
Targeted Romanian Online News in a Mobile Application Using AI
A Three Word-Level Approach Used in Machine Learning for Romanian Sentiment Analysis
I am really thankful to have you as both professor and academic advisor - Ștefan TRĂUȘAN-MATU and Traian REBEDEA