g2pW: Mandarin Grapheme-to-Phoneme Converter

Authors: Yi-Chang Chen, Yu-Chuan Chang, Yen-Cheng Chang and Yi-Ren Yeh

This is the official repository of our paper g2pW: A Conditional Weighted Softmax BERT for Polyphone Disambiguation in Mandarin.

Getting Started

Dependency / Install

(This work was tested with PyTorch 1.7.0, CUDA 10.1, python 3.6 and Ubuntu 16.04.)

Install PyTorch
$ pip install g2pw

Quick Demo

>>> from g2pw import G2PWConverter
>>> conv = G2PWConverter()
>>> sentence = '上校請技術人員校正FN儀器'
>>> conv(sentence)
[['ㄕㄤ4', 'ㄒㄧㄠ4', 'ㄑㄧㄥ3', 'ㄐㄧ4', 'ㄕㄨ4', 'ㄖㄣ2', 'ㄩㄢ2', 'ㄐㄧㄠ4', 'ㄓㄥ4', None, None, 'ㄧ2', 'ㄑㄧ4']]
>>> sentences = ['銀行', '行動']
>>> conv(sentences)
[['ㄧㄣ2', 'ㄏㄤ2'], ['ㄒㄧㄥ2', 'ㄉㄨㄥ4']]

Use GPU to Speed Up

conv = G2PWConverter(use_cuda=True)

Load Offline Model

conv = G2PWConverter(model_dir='./G2PWModel-v1/')

Support Simplified Chinese and Pinyin

$ pip install g2pw[opencc]

>>> from g2pw import G2PWConverter
>>> conv = G2PWConverter(style='pinyin', enable_non_tradional_chinese=True)
>>> conv('然而，他红了20年以后，他竟退出了大家的视线。')
[['ran2', 'er2', None, 'ta1', 'hong2', 'le5', None, None, 'nian2', 'yi3', 'hou4', None, 'ta1', 'jing4', 'tui4', 'chu1', 'le5', 'da4', 'jia1', 'de5', 'shi4', 'xian4', None]]

Scripts

$ git clone https://github.com/GitYCC/g2pW.git

Train Model

For example, we train models on CPP dataset as follows:

$ bash cpp_dataset/download.sh
$ python scripts/train_g2p_bert.py --config configs/config_cpp.py

Prediction

$ python scripts/test_g2p_bert.py \
--config saved_models/CPP_BERT_M_DescWS-Sec-cLin-B_POSw01/config.py \
--checkpoint saved_models/CPP_BERT_M_DescWS-Sec-cLin-B_POSw01/best_accuracy.pth \
--sent_path cpp_dataset/test.sent \
--output_path output_pred.txt

Testing

$ python scripts/predict_g2p_bert.py \
--config saved_models/CPP_BERT_M_DescWS-Sec-cLin-B_POSw01/config.py \
--checkpoint saved_models/CPP_BERT_M_DescWS-Sec-cLin-B_POSw01/best_accuracy.pth \
--sent_path cpp_dataset/test.sent \
--lb_path cpp_dataset/test.lb

Checkpoints

G2PWModel-v1.zip

Citation

@misc{chen2022g2pw,
      title={g2pW: A Conditional Weighted Softmax BERT for Polyphone Disambiguation in Mandarin}, 
      author={Yi-Chang Chen and Yu-Chuan Chang and Yen-Cheng Chang and Yi-Ren Yeh},
      year={2022},
      eprint={2203.10430},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
configs		configs
cpp_dataset		cpp_dataset
g2pw		g2pw
misc		misc
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENCE		LICENCE
MANIFEST.in		MANIFEST.in
README.md		README.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

g2pW: Mandarin Grapheme-to-Phoneme Converter

Getting Started

Dependency / Install

Quick Demo

Use GPU to Speed Up

Load Offline Model

Support Simplified Chinese and Pinyin

Scripts

Train Model

Prediction

Testing

Checkpoints

Citation

About

Releases

Packages

Languages

License

resemble-ai/g2pW

Folders and files

Latest commit

History

Repository files navigation

g2pW: Mandarin Grapheme-to-Phoneme Converter

Getting Started

Dependency / Install

Quick Demo

Use GPU to Speed Up

Load Offline Model

Support Simplified Chinese and Pinyin

Scripts

Train Model

Prediction

Testing

Checkpoints

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages