Skip to content

Persian Irony Detection, include a Persian dataset, creating a dataset automatically, and finetuning transformer-based language models for the task

Notifications You must be signed in to change notification settings

fatemenajafi135/Irony-detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Persian Irony Detection using transformer-based language models

  • input: a text in Persian
  • output: classifying text as ironic and non-ironic

Dataset

Existing datasets:

Create new dataset steps (Crawling Persian tweets from a channel on Telegram and automatically labeling them)

  • crawling: Crawl public channels' messages on Telegram using the api of Telegram server in file crawling.py. Save crawled messages (json files) in ./crawled_messages

  • gathering: Concatenate crawled files, save wanted attributes of each tweet in a Pandas DataFrame, and save it in a csv file. The file gathering.py creates messages.csv.

  • cleaning: Basic clean on the previously created dataset and save it to messages_cleaned.csv

  • labeling: Set label to each tweet by its top-2 common reactions and split dataset to Train and Test sets. It saves files in ../dataset/.

  • Run: (The previous dataset will be replaced)

cd creating_dataset/
pip install requirements.txt
python crawling.py
python gathering.py
python cleaning.py
python labeling.py

Model

Finetuning an uncased language model on the Persian irony detection dataset

cd model/ 
pip install -r requirements.txt

Finetuning a transformer-based language model on irony detection dataset

python train.py  --datapath [path to dataset] --modelpath [path to transformer-based language model] --modelout [path to save finetuned model] --savemodel [path to save finetuned model] --maxlen [maximum sequence length] --batch [batch size] --epoch [epochs] --lr [learning rate]
# example
python train.py --datapath ../dataset/ --modelpath xlm-roberta-base --batch 16 --epoch 5

Predict label using trained model

python predict.py  --datapath [path to dataset] --modelpath [path to transformer-based language model] --predspath [path for preditions of test set] --maxlen [maximum sequence length] --batch [batch size] --epoch [epochs] --lr [learning rate]
# example
python predict.py --datapath ../dataset/ --modelpath xlm-roberta-base --predspath runs/preds

Results

Comparison of different finetuned language models on the Persian dataset

Language Model Accuracy Recall Precision F1
ParsBert vr3 81.3% 81.4% 81.3% 81.3%
XLM-RoBERTa-Base 82.6% 82.8% 82.6% 82.5%
XLM-RoBERTa-Large 84.7% 84.7% 84.6% 84.6%

About

Persian Irony Detection, include a Persian dataset, creating a dataset automatically, and finetuning transformer-based language models for the task

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages