Skip to content

Implement Khmer script conversion #717

@ronaldtse

Description

@ronaldtse

This issue will serve as the central issue relating to implementing the first Khmer script conversion system.

cc: @monyoudom @artkulak @wkwong-ribose

Introduction to script conversion

Script conversion includes the process of transliteration (pure script to script conversion), transcription (script to phonetic/other means to script conversion) and Romanization (transcription or transliteration from non-Latin to Latin script).

The script conversion process itself is a deterministic action, the results are either correct, or incorrect.

For some simple script conversion systems, such as "Cyrillic to Latin", the script conversion process is the whole process, performed via an alphabet to alphabet mapping step.

However, in BGN systems that do Cyrillic to Latin, there are additional rules, such as capitalising proper nouns (place names, personal names). These require additional contextual information. A rule-based approach could work but this is where deep learning becomes necessary.

For languages like Khmer, it becomes more complex because hidden vowels, phonetic components, syllable boundaries, word segmentation, etc. factors need to be taken into account when performing script conversion (e.g. Khmer to Latin).

In the case of Arabic, there are a few stages that need to happen:

  • unpointed Arabic (the typical form, lacking vowels) => fully-pointed Arabic (with all vowel/diacritic information) (a preparation stage)
  • word categorisation (proper nouns, special names) (a preparation stage)
  • actual script conversion stage
    • character mappings that apply to general text
    • character mappings that apply to special word categories
  • flattening of converted script (a postprocessing stage)

We need to figure out how many and which of these stages will benefit from deep learning, and which of these stages need to be deterministically performed.

We need to elaborate these stages, and implement these steps for Khmer.

Interscript and Khmer

Khmer is the language of Cambodia, and it uses an Abugida script.

Transliteration of Abugida scripts cannot be performed via simple substitution or inference due to issues discussed in #253.

Features needed are reproduced below:

For higher accuracy, the following features are needed:

  • Dictionary Lookup (have been implemented for Korean, see #240 )
  • Frequency lookup (check syllable frequency or bi-gram / tri-gram frequency table)
  • Control structure (if..then..else)
  • Variables (implemented, see interscript map development guide)
  • A built-in way to mark syllabic boundaries (missing)

In Interscript, the model training and usage is performed via Secryst.

Current status

There are two datasets we have tried:

"data-khmer-translit" was built according to the data from these sources.

The method of training is described in the Secryst README, where the example dataset "khm-latn" is described.

So far, the results of naively training these datasets produce subpar results. We don't even have a good way of quantifying the quality because we have lacked a Khmer expert until now (thank you @monyoudom ).

Moving forward

There are a few steps we need to address:

  1. Basic Khmer understanding. Figure out which of the stages for Khmer script conversion can be done using deep learning. Determine which of the stages need to be deterministically performed (e.g. character mapping).
  2. Confirm what features are needed in Interscript and the trainer (Secryst).
  3. Investigate the UN Khmer system #667 to see what needs to be implemented
  4. Compile necessary data and perform fitting
  5. Integrate trained inference models into Interscript flow, and if additional features are needed, request help from other members of @interscript/developers to implement them.

That's it!

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions