Welcome to our repository! This repository hosts the data on "IndoCollex: A Testbed for Morphological Transformation of Indonesian Word Colloquialism" Research Paper published on ACL-IJCNLP 2021. We also provide the guideline on how we annotate the data.
├───data
| ├───full.csv
│ ├───formal_to_informal
│ └───informal_to_formal
├───dict
└───guideline
data/formal_to_informal
: Data to train our formal to informal system in the paper (contains train, dev, test)data/informal_to_formal
: Data to train our informal to formal system in the paper (contains train, dev, test)data/full.csv
: Full data that will be divided intoformal_to_informal
andinformal_to_formal
data. Some labels are also excluded.guideline
: PDF guideline how we annotate the datadict
: You can find formal-informal phrase level Indonesian dictionary (i.e. kamus alay), in tsv format.
We break down colloqual transformation into several categories as follow:
Category | Description | Example |
---|---|---|
Disemvowelling | elimination of some or all the vowels |
kemarin - kmrn belum - blum besok - bsk bagaimana - bgmn |
Affixation | modification, addition or removal of affixes |
menyanyikan - nyanyiin mengabari - ngabarin |
Shortening | shortening of the original word. | sudah - dah internet - inet halusinasi - halu |
Space/dash removal | Space and dash removal, including collapsing repeated words. |
di rumah - dirumah terima kasih - terimakasih ibu-ibu - ibu2 |
Sound alter | slight change both in sound and/or spelling in the text |
pakai - pake pahit - pait aku - akuh |
Acronym | Syllabic and letter compounds of one or more words akin to acronyms, abbreviations, and portmanteau |
ibu hamil - bumil budak cinta - bucin anak baru gede - abg |
Reverse | Letter reversal, or colloquially known as “Boso Walikan” |
malang - ngalam bang - ngab |
Loan words* | borrowed words, often from local language or English. |
bapak - bokap |
Jargon* | tagline, terms that have been made into a popular term |
mana saya tahu - meneketehe |
- We exclude these from our model data, but you can find them in the end-to-end formal-informal dictionary.
Some of Indonesian colloquial words are constructed by applying multiple transformation sequences, for example:
teman-teman -> teman2 -> temen2
bagaimana -> gimana -> gmn
You can find our paper here: https://aclanthology.org/2021.findings-acl.280.pdf
If you use any of our work for your academic work, please cite:
@inproceedings{wibowo-etal-2021-indocollex,
title = "{I}ndo{C}ollex: A Testbed for Morphological Transformation of {I}ndonesian Word Colloquialism",
author = {Wibowo, Haryo Akbarianto and Nityasya, Made Nindyatama and Aky{\"u}rek, Afra Feyza and Fitriany, Suci and Aji, Alham Fikri and Prasojo, Radityo Eko and Wijaya, Derry Tanti},
booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.findings-acl.280",
doi = "10.18653/v1/2021.findings-acl.280",
pages = "3170--3183",
}
- Haryo Akbarianto Wibowo @ Kata.ai
- Made Nindyatama Nityasya @ Kata.ai
- Afra Feyza Akyürek @ Boston University
- Suci Fitriany @ Kata.ai
- Alham Fikri Aji @ Kata.ai
- Radityo Eko Prasojo @ Kata.ai & Universitas Indonesia
- Derry Tanti Wijaya @ Boston University