This is a dictionary matching Mandarin Chinese pinyin syllables to simplified characters that may have the corresponding reading. The characters are sorted according to their frequency and split into 4 groups:
- Majority characters are the most frequent characters comprising 95% of the corpus.
- Minority characters are the most frequent characters comprising 99.8% of the corpus except the majority characters.
- Rare characters are the least frequent characters comprising 0.02% of the corpus.
- Uncommon characters are those present in the Unicode database, but not observed in the corpus.
Here is an example dictionary entry for "ge":
If a character has several readings, I try to estimate the frequency for each reading separately. This is the reason why you can see the same character in different groups in the entry above. But this estimate may be inaccurate, because frequency data for different readings is very limited in available data bases.
The main artifact of the project is a DSL dictionary you can add to GoldenDict or a similar dictionary manager. Copy ChPhoneticIndex.ann, ChPhoneticIndex.bmp, and ChPhoneticIndex.dsl into your custom dictionary folder, add it to your favorite dictionary manager and enjoy.
Here you can look at the result represented as a table sorted by frequency or sorted by pinyin. Here is also some database statistics.
The entropy value characterizes the uncertainty in the graphical representation of a given pinyin syllable (uncommon characters excluded). Entropy of 0 means it can be written using only one character, larger values mean more options available. If a syllable can be represented by n characters, its entropy will be maximized if all the characters have the same frequency (the uniform distribution). Here is a reference table for this case:
n | Entropy of the uniform distribution |
---|---|
1 | 0.000 |
2 | 0.693 |
3 | 1.099 |
4 | 1.386 |
5 | 1.609 |
6 | 1.792 |
7 | 1.946 |
8 | 2.079 |
9 | 2.197 |
10 | 2.303 |
11 | 2.398 |
12 | 2.485 |
13 | 2.565 |
14 | 2.639 |
Input data:
- Unicode Han Database: Unihan_Readings and Unihan_Variants tables.
- Jun Da Character frequency list of Modern Chinese.
- All the tables are merged into a single one, frequency numbers of Jun Da and HanyuPinlu are added together.
- Leave only HanyuPinlu, HanyuPinyin, Mandarin, TGHZ2013, and XHC1983 fields from Unihan_Readings.
- Traditional characters fully replaced by simplified forms are filtered out.
- Estimate a frequency for each (
character
,pinyin
) pair: if a relative frequency value from HanyuPinlu is available, use it. Ifcharacter
is in Jun Da andpinyin
is in the corresponding Mandarin field, use Jun Da frequency divided by the number of Mandarin readings (usually, there is only one). Otherwise, set frequency to 0. - Estimate a rank for each (
character
,pinyin
) pair: for each Unihan_Readings field find thepinyin
index, inverse it and sum the results. E.g. for 啁 HanyuPinyin is "zhāo dāo zhōu tiáo diào", Mandarin is "zhāo", both XHC1983 and TGHZ2013 are "zhāo zhōu". Than zhāo rank is 1/1 * 4 = 4, dāo rank is 1/2 = 0.5, zhōu rank is 1/3 + 1/2 * 2 = 1.333, etc. - Calculate cumulative frequency and split (
character
,pinyin
) pairs into 4 groups. Within each group sort by (frequency, rank). - Generate the output.
If you want to reproduce the results, you need Python 3.10 or newer with installed requirements, CMake 3.18 or newer and some CMake-compatible make. Launch it in a regular way, e.g.:
cmake -S ChinesePhoneticIndex -B ChinesePhoneticIndex/build
cmake --build ChinesePhoneticIndex/build
If you want to commit the output, pass -DRELEASE_BUILD
in the configuration step.