parallel-corpus-for-lexical-normalization

Hi,

This is a parallel corpus of slang sentences (sentences that may contain slang words) and formal sentences (sentances that only contain formal words) in Indonesian language. This dataset, consisting of 4,910 parallel sentence pairs, was used by Kurnia and Yulianti (2020) for lexical/text normalization using statistical machine translation. The sentences in this dataset come from Instagram post that were collected in previous research (Salsabila et al., 2018) to build Indonesian colloquial lexicon. In this dataset, the --- is used as a separator between parallel sentence pairs; and the ~~~ symbol is used as a separator between a slang sentence and its corresponding formal sentence.

Please cite this paper if you use this dataset:
@inproceedings{kurnia2020statistical,
   title={Statistical Machine Translation Approach for Lexical Normalization on Indonesian Text},
   author={Kurnia, Ajmal and Yulianti, Evi},
   booktitle={2020 International Conference on Asian Language Processing (IALP)},
   pages={288--293},
   year={2020},
   organization={IEEE}
}

If you have any questions regarding this dataset, you may contact ajmal.kurnia@ui.ac.id.

Thank you!

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
parralel-corpus.txt		parralel-corpus.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

parallel-corpus-for-lexical-normalization

About

Releases

Packages

ir-nlp-csui/parallel-corpus-for-lexical-normalization

Folders and files

Latest commit

History

Repository files navigation

parallel-corpus-for-lexical-normalization

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages