Skip to content

ir-nlp-csui/parallel-corpus-for-lexical-normalization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

parallel-corpus-for-lexical-normalization

Hi,

This is a parallel corpus of slang sentences (sentences that may contain slang words) and formal sentences (sentances that only contain formal words) in Indonesian language. This dataset, consisting of 4,910 parallel sentence pairs, was used by Kurnia and Yulianti (2020) for lexical/text normalization using statistical machine translation. The sentences in this dataset come from Instagram post that were collected in previous research (Salsabila et al., 2018) to build Indonesian colloquial lexicon. In this dataset, the --- is used as a separator between parallel sentence pairs; and the ~~~ symbol is used as a separator between a slang sentence and its corresponding formal sentence.

Please cite this paper if you use this dataset:
@inproceedings{kurnia2020statistical,
   title={Statistical Machine Translation Approach for Lexical Normalization on Indonesian Text},
   author={Kurnia, Ajmal and Yulianti, Evi},
   booktitle={2020 International Conference on Asian Language Processing (IALP)},
   pages={288--293},
   year={2020},
   organization={IEEE}
}

If you have any questions regarding this dataset, you may contact ajmal.kurnia@ui.ac.id.

Thank you!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published