Implement Khmer script conversion

This issue will serve as the central issue relating to implementing the first Khmer script conversion system.

cc: @monyoudom @artkulak @wkwong-ribose

## Introduction to script conversion

Script conversion includes the process of transliteration (pure script to script conversion), transcription (script to phonetic/other means to script conversion) and Romanization (transcription or transliteration from non-Latin to Latin script).

The script conversion process itself is a deterministic action, the results are either correct, or incorrect.

For some simple script conversion systems, such as "Cyrillic to Latin", the script conversion process is the whole process, performed via an alphabet to alphabet mapping step.

However, in BGN systems that do Cyrillic to Latin, there are additional rules, such as capitalising proper nouns (place names, personal names). These require additional contextual information. A rule-based approach could work but this is where deep learning becomes necessary.

For languages like Khmer, it becomes more complex because hidden vowels, phonetic components, syllable boundaries, word segmentation, etc. factors need to be taken into account when performing script conversion (e.g. Khmer to Latin).

In the case of Arabic, there are a few stages that need to happen:
* unpointed Arabic (the typical form, lacking vowels) => fully-pointed Arabic (with all vowel/diacritic information) (a preparation stage)
* word categorisation (proper nouns, special names) (a preparation stage)
* actual script conversion stage
  * character mappings that apply to general text
  * character mappings that apply to special word categories
* flattening of converted script (a postprocessing stage)

We need to figure out how many and which of these stages will benefit from deep learning, and which of these stages need to be deterministically performed.

We need to elaborate these stages, and implement these steps for Khmer.

## Interscript and Khmer

Khmer is the language of Cambodia, and it uses an Abugida script.

Transliteration of Abugida scripts cannot be performed via simple substitution or inference due to issues discussed in #253.

Features needed are reproduced below:
> For higher accuracy, the following features are needed:
> * Dictionary Lookup (have been implemented for Korean, see #240 )
> * Frequency lookup (check syllable frequency or bi-gram / tri-gram frequency table)
> * Control structure (if..then..else)
> * Variables (implemented, see interscript map development guide)
> * A built-in way to mark syllabic boundaries (missing)

In Interscript, the model training and usage is performed via [Secryst](https://github.com/secryst/secryst).

## Current status

There are two datasets we have tried:
* example dataset in Secryst: ["khm-latn"](https://github.com/secryst/secryst/tree/master/examples/khm-latn)
* [data-khmer-translit dataset](https://github.com/secryst/data-khmer-translit) in its own repo

"data-khmer-translit" was built according to the data from [these sources](https://github.com/secryst/data-khmer-translit#origin).

The method of training is described in the [Secryst README](https://github.com/secryst/secryst#examples-1), where the example dataset "khm-latn" is described.

So far, the results of naively training these datasets produce subpar results. We don't even have a good way of quantifying the quality because we have lacked a Khmer expert until now (thank you @monyoudom ).

## Moving forward

There are a few steps we need to address:

1. Basic Khmer understanding. Figure out which of the stages for Khmer script conversion can be done using deep learning. Determine which of the stages need to be deterministically performed (e.g. character mapping).
2. Confirm what features are needed in Interscript and the trainer (Secryst).
3. Investigate the UN Khmer system #667 to see what needs to be implemented
4. Compile necessary data and perform fitting
5. Integrate trained inference models into Interscript flow, and if additional features are needed, request help from other members of @interscript/developers to implement them.

That's it!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement Khmer script conversion #717

Introduction to script conversion

Interscript and Khmer

Current status

Moving forward

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement Khmer script conversion #717

Description

Introduction to script conversion

Interscript and Khmer

Current status

Moving forward

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions