This repository contains all the related artifacts (models, datasets, code) of the paper Organic Data-Driven Approach for Turkish Grammatical Error Correction and LLMs
The following is an overview of the datasets utilized in this work. The datasets in the top half are synthetic and the bottom ones, the evaluation sets, are humanly annotated. The error type ERRANT refers to the automatic annotation tool ERRANT, which automatically annotates parallel sentences with error-type information. Tokens information is based on OpenAI's tokenizer tiktoken with gpt2 encodings.
Dataset Name | Split | Sentences | Tokens | Error Types | Domain |
---|---|---|---|---|---|
OSCAR GEC (ours) | Train | 2.3m | 213.2m | ERRANT | Web |
GPT GEC (ours) | Train | 100k | 3.6m | ERRANT | Web |
GECTurk (Kara et al, 2023) | Train | 138k | 5.8m | 25 | Newspapers |
OSCAR GEC (ours) | Test | 2.4k | 142k | ERRANT | Web |
Movie Reviews (Kara et al, 2023) | Test | 300 | 2.7k | 25 | Movie Reviews |
Turkish Tweets (Koksal et al, 2020a) | Test | 2k | 116.2k | 13 | Tweets |
In addition to the above datasets, we also open-source the Turkish Spelling Dictionary developed in this work. You may access it from here
The following are our fine-tuned mT5 models for the Turish Grammatical Error Correction task on our two training datasets: OSCAR GEC and GPT GEC. The models are available on HuggingFace:
Model 1: Turkish-OSCAR-GEC
Model 2: Turkish-GPT-GEC
The following are the results of Turkish GEC models on 3 evaluation sets:
Eval set 1: OSCAR GEC (ours)
Model | P | R | F0.5 |
---|---|---|---|
GPT GEC (mT5) | 69.8 | 44.9 | 62.8 |
OSCAR GEC (mT5) | 68.7 | 31.2 | 55.4 |
GECTurk (mT5) | 42.5 | 5.7 | 18.2 |
GECTurk (Seq Tagger) (Kara et al., 2023) | 49.0 | 3.9 | 14.7 |
Eval set 2: Turkish Tweets (Koksal et al, 2020a)
Model | P | R | F0.5 |
---|---|---|---|
OSCAR GEC (mT5) | 85.1 | 61.3 | 79.0 |
GPT GEC (mT5) | 77.7 | 68.9 | 75.8 |
GECTurk (Seq Tagger) (Kara et al, 2023) | 64.7 | 19.8 | 44.5 |
GECTurk (mT5) | 57.2 | 20.7 | 42.3 |
Eval set 3: Movie Reviews (Kara et al, 2023)
Model | P | R | F0.5 |
---|---|---|---|
GECTurk (Seq Tagger) (Kara et al., 2023) | 86.5 | 76.2 | 84.2 |
GECTurk (mT5) | 73.1 | 71.8 | 72.8 |
GPT GEC (mT5) | 36.0 | 46.3 | 37.6 |
OSCAR GEC (mT5) | 30.0 | 22.5 | 28.1 |