Skip to content

Python intefrace for evaluation on chatgpt models

License

Notifications You must be signed in to change notification settings

AI-Initiative-KAUST/Taqyim

 
 

Repository files navigation

Taqyim تقييم

A library for evaluting Arabic NLP datasets on chatgpt models.

Installation

pip install -e .

Example

import taqyim as tq
pipeline = tq.Pipeline(
    eval_name="ajgt-test",
    dataset_name="arbml/ajgt_ubc_split",
    task_class="classification",
    task_description= "Sentiment Analysis",
    input_column_name="content",
    target_column_name="label",
    prompt="Predict the sentiment",
    api_key="<openai-key>",
    train_split="train",
    test_split="test",
    model_name="gpt-3.5-turbo-0301",
    max_samples=1,)

# run the evaluation
pipeline.run()

# show the output data frame
pipeline.show_results()

# show the eval metrics
pipeline.get_final_report()

Run on custom dataset

custom_dataset.ipynb has a complete example on how to run evaluation on a custom dataset.

parameters

  • eval_name choose an eval name
  • task_class class name from supported class names
  • task_desc short description about the task
  • dataset_name dataset name for evaluation
  • subset If the dataset has subset
  • train_split train split name in the dataset
  • test_splittest split name in the dataset
  • input_column_name input column name in the dataset
  • target_column_name target column name in the dataset
  • prompt the prompt to be fed to the model
  • task_description short string explaining the task
  • api_key api key from keys
  • preprocessing_fn function used to process inputs and targets
  • threads number of threads used to fetch the api
  • threads_timeout thread timeout
  • max_samples max samples used for evaluation from the dataset
  • model_name choose either gpt-3.5-turbo-0301 or gpt-4-0314
  • temperature temperature passed to the model between 0 and 2, higher temperature means more random results
  • num_few_shot number of fewshot samples to be used for evaluation
  • resume_from_record if True it will continue the run from the sample that has no results.
  • seed seed to redproduce the results

Supported Classes and Tasks

Evaluation on Arabic Tasks

Tasks Dataset Size Metrics GPT-3.5 GPT-4 SoTA
Summarization EASC 153 RougeL 23.5 18.25 13.3
PoS Tagging PADT 680 Accuracy 75.91 86.29 96.83
classification AJGT 360 Accuracy 86.94 90.30 96.11
transliteration BOLT Egyptian 6,653 BLEU 13.76 27.66 65.88
translation UN v1 4,000 BLEU 35.05 38.83 53.29
Paraphrasing APB 1,010 BLEU 4.295 6.104 17.52
Diacritization WikiNews✢✢ 393 WER/DER 32.74/10.29 38.06/11.64 4.49/1.21

✢ BOLT requires LDC subscription

✢✢ WikiNews not public, contact authors to access the dataset

@misc{alyafeai2023taqyim,
      title={Taqyim: Evaluating Arabic NLP Tasks Using ChatGPT Models}, 
      author={Zaid Alyafeai and Maged S. Alshaibani and Badr AlKhamissi and Hamzah Luqman and Ebrahim Alareqi and Ali Fadel},
      year={2023},
      eprint={2306.16322},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

Python intefrace for evaluation on chatgpt models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 87.8%
  • Python 12.2%