Skip to content

Datasets, models and code for the Turkish Grammatical Error Correction task

Notifications You must be signed in to change notification settings

asimokby/Turkish-GEC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 

Repository files navigation

Turkish-GEC

This repository contains all the related artifacts (models, datasets, code) of the paper Organic Data-Driven Approach for Turkish Grammatical Error Correction and LLMs

Datasets

The following is an overview of the datasets utilized in this work. The datasets in the top half are synthetic and the bottom ones, the evaluation sets, are humanly annotated. The error type ERRANT refers to the automatic annotation tool ERRANT, which automatically annotates parallel sentences with error-type information. Tokens information is based on OpenAI's tokenizer tiktoken with gpt2 encodings.

Dataset Name Split Sentences Tokens Error Types Domain
OSCAR GEC (ours) Train 2.3m 213.2m ERRANT Web
GPT GEC (ours) Train 100k 3.6m ERRANT Web
GECTurk (Kara et al, 2023) Train 138k 5.8m 25 Newspapers
OSCAR GEC (ours) Test 2.4k 142k ERRANT Web
Movie Reviews (Kara et al, 2023) Test 300 2.7k 25 Movie Reviews
Turkish Tweets (Koksal et al, 2020a) Test 2k 116.2k 13 Tweets

In addition to the above datasets, we also open-source the Turkish Spelling Dictionary developed in this work. You may access it from here

Models

The following are our fine-tuned mT5 models for the Turish Grammatical Error Correction task on our two training datasets: OSCAR GEC and GPT GEC. The models are available on HuggingFace:

Model 1: Turkish-OSCAR-GEC

Model 2: Turkish-GPT-GEC

Results

The following are the results of Turkish GEC models on 3 evaluation sets:

Eval set 1: OSCAR GEC (ours)

Model P R F0.5
GPT GEC (mT5) 69.8 44.9 62.8
OSCAR GEC (mT5) 68.7 31.2 55.4
GECTurk (mT5) 42.5 5.7 18.2
GECTurk (Seq Tagger) (Kara et al., 2023) 49.0 3.9 14.7

Eval set 2: Turkish Tweets (Koksal et al, 2020a)

Model P R F0.5
OSCAR GEC (mT5) 85.1 61.3 79.0
GPT GEC (mT5) 77.7 68.9 75.8
GECTurk (Seq Tagger) (Kara et al, 2023) 64.7 19.8 44.5
GECTurk (mT5) 57.2 20.7 42.3

Eval set 3: Movie Reviews (Kara et al, 2023)

Model P R F0.5
GECTurk (Seq Tagger) (Kara et al., 2023) 86.5 76.2 84.2
GECTurk (mT5) 73.1 71.8 72.8
GPT GEC (mT5) 36.0 46.3 37.6
OSCAR GEC (mT5) 30.0 22.5 28.1

About

Datasets, models and code for the Turkish Grammatical Error Correction task

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published