Skip to content

C++ tools for working with multi-byte text, mainly focused on Japanese Kanji and Kana.

License

Notifications You must be signed in to change notification settings

anzumura/kanji-tools

Repository files navigation

C++ kanji Tools

[TOC] GitHub automatically generates a Table of Contents in the header since this change in April 2021. Note, relative links to directories and some file types don't work from the Doxygen generated main page.

Introduction

This repository contains code for four main programs:

  • kanaConvert: program that converts between Hiragana, Katakana and Rōmaji
  • kanjiFormat: program used to format sample-data files (from 青空文庫 - see below)
  • kanjiQuiz: interactive program that allows a user to choose from various types of quizzes
  • kanjiStats: classifies and counts multi-byte characters in a file or directory tree

The initial goal for this project was to create a program that could parse multi-byte (UTF-8) input and classify Japanese Kanji (漢字) characters into official categories in order to determine how many Kanji fall into each category in real-world examples. The quiz program was added later once the initial work was done for loading and classifying Kanji. The format program was created to help with a specific use-case that came up while gathering sample text from Aozora - it's a small program that relies on some of the generic code created for the stats program.

Project Structure

The project is build using cmake (installed via Homebrew) so there is a CMakeLists.txt file in the top directory that builds five libs (C++ static libraries for now), the four main programs (mentioned in the Introduction) plus all the test code. The tests are written using GoogleTest test framework. The project has the following directories:

  • apps: CMakeLists.txt and a .cpp file for each main program
  • build: generated for build targets and cmake dependencies
  • data: data files described in Kanji Data section
  • docs: docs and PlantUML diagrams
  • scripts: .sh bash scripts for working with Unicode data files
  • libs: has a directory per lib, each containing:
    • include: .h files for the lib
    • src: CMakeLists.txt and .cpp files for the lib
  • tests: has testMain.cpp, an include directory and a directory per lib:
    • each sub-directory has CMakeLists.txt and .cpp files

The five libraries are:

  • utils: utility classes used by all 4 main programs
  • kana: code used by kanaConvert program (depends on utils lib)
  • kanji: code for loading Kanji and Ucd data (depends on kana lib)
  • stats: code used by kanjiStats program (depends on kanji lib)
  • quiz: code used by kanjiQuiz program (depends on kanji lib)

VS Code Setup

The code was written using VS Code IDE on an M1 Mac and compiles with either clang++ (version 14.0.3) installed via Xcode command-line tools (xcode-select --install) or g++-13 (version 13.1.0) installed via Homebrew (brew install gcc). Some other useful brew formulas for this project are: bash, clang-format, cmake, doxygen and gcovr). It should also build on other Unix/Linux systems, but there are assumptions related to wchar_t and multi-byte handling that won't currently compile on Windows 10.

Here are some links that might help with setup:

Here's a list of VS Code extensions being used:

Notes:

  • Better Comments: can help distinguish /// Doxygen comments by using a different color for tag "/" in "better-comments.tags" (in User Settings)
  • Code Spell Checker: there are lots of word entries for this extension in .vscode/settings.json (mainly caused by all the Japanese words in test code)
  • CodeLLDB: current setup has some limitations (see comments in .vscode/launch.json for more details)
  • PlantUML is used to generate diagrams from the .txt files in docs/diagrams/src. In order to generate them locally graphviz must be installed. On Mac this can be done via brew install --cask temurin; brew install graphviz

Compiler Diagnostic Flags

The code builds without warnings using a large set of diagnostic flags such as -Wall, -Wextra (equivalent to -W), -Wconversion, etc.. -Werror is also included to ensure the code remains warning-free. Finally, only one type of warning has been disabled (requiring parentheses for some expressions that seemed excessive). clang-tidy (which is nicely integrated with VS Code) is also being used for diagnostics (see .clang-tidy for details on what's being checked).

The following table shows flags used per compiler (Common shows flags used for both). Diagnostics enabled by default or enabled via another flag such as -Wall are not included (at least that's the intention):

Compiler Standard Diagnostic Flags Disabled
Common -Wall -Wconversion -Wdeprecated -Werror -Wextra -Wextra-semi -Wignored-qualifiers -Wnonnull -Wold-style-cast -Wpedantic -Wsuggest-override -Wswitch-enum -Wzero-as-null-pointer-constant
Clang c++2a -Wcovered-switch-default -Wduplicate-enum -Wheader-hygiene -Wloop-analysis -Wshadow-all -Wsuggest-destructor-override -Wunreachable-code-aggressive -Wno-logical-op-parentheses
GCC c++20 -Wnon-virtual-dtor -Woverloaded-virtual -Wshadow -Wuseless-cast -Wno-parentheses

Notes:

C++ Features

An effort was made to use modern C++ features including C++ 11 std::move, std::forward, std::make_shared, nullptr, noexcept, constexpr, etc.. Also, uniform initialization and type inference are used whenever possible for consistency. Below, are lists of some specific features used from the latest three C++ standard versions:

C++ 20:

C++ 17:

  • if constexpr (expression), [[nodiscard]], initializers in if and switch
  • auto for non-typed template, new rules for auto type deduction
  • class template argument deduction (CTAD), so don't need std::make_pair, etc..
  • inline variables (don't violate one definition rule), optional static_assert message
  • std::filesystem, std::string_view, std::optional and std::size
  • _v helpers instead of value, i.e., std::is_unsigned_v

C++ 14:

Kana Convert

The kanaConvert program was created to parse the UniHan XML files (from Unicode Consortium) which have 'On' (音) and 'Kun' (訓) readings, but only in Rōmaji. The program can read stdin and supports various flags for controlling conversion (like Hepburn or Kunrei) and it has an interactive mode. Here are some examples:

$ kanaConvert atatakai
あたたかい
$ kanaConvert kippu
きっぷ
$ echo kippu | kanaConvert -k  # can be used in pipes
キップ
$ echo ジョン・スミス | kanaConvert -r
jon/sumisu
$ echo かんよう かんじ | kanaConvert -r
kan'you kanji
$ kanaConvert -r ラーメン  # uses macrons when converting from 'prolong mark'
rāmen
$ kanaConvert -h rāmen
らーめん
$ kanaConvert -r こゝろ  # supports repeat marks
kokoro
$ kanaConvert -r スヾメ
suzume
$ kanaConvert -k qarutetto  # supports multiple romaji variants:
クァルテット
$ kanaConvert -k kwarutetto
クァルテット

Kana Conversion Chart

Passing '-p' to kanaConvert causes it to print out a Kana Chart that shows the Rōmaji letter combinations that are supported along with some notes and totals. The output is aligned properly in a terminal using a fixed font (or an IDE depending on the font - see Table.h for more details). However, the output doesn't align properly in a Markdown code block (wide to narrow character ratio isn't exactly 2:1) so there's also a '-m' option to print using markdown formatting.

  • Note: the terminal output (-p) puts a border line between sections (sections for the Kana chart table are groups of related Kana symbols, i.e., 'a', 'ka', 'sa', etc.), but for markdown (-m) rows starting a section are in bold instead:

Kana Conversion Chart

Kana Class Diagram

The following diagram shows the Kana class hierarchy as well as some of the public methods.

Kana Class Diagram

See Kana.h for details, but in summary, the derived classes are:

  • DakutenKana: represents a Kana that has a dakuten version. It holds an AccentedKana accessible via the overridden dakuten() method to return the accented form, i.e., [ka, か, カ] is an instance of DakutenKana and calling dakuten() on it returns [ga, が, ガ]
  • HanDakutenKana: derives from DakutenKana and holds an AccentedKana accessible via the overridden hanDakuten() method - this class is used for ha-gyō (は-行) Kana which have both dakuten and hanDakuten versions.
  • AccentedKana: has a pointer back to its plain holder

Kanji Data

To support kanjiStats and kanjiQuiz programs, KanjiData class loads and breaks down Kanji into the following categories:

  • Jouyou: 2136 official Jōyō (常用) Kanji
  • Jinmei: 633 official Jinmeiyō (人名用) Kanji
  • LinkedJinmei: 230 more Jinmei Kanji that are old/variant forms of Jōyō (212) or Jinmei (18)
  • LinkedOld: 213 old/variant Jōyō Kanji that aren't in 'Linked Jinmei'
  • Frequency: Kanji that are in the top 2501 frequency list, but not one of the first 4 types
  • Extra: Kanji loaded from 'extra.txt' - shouldn't be in any of the above types
  • Kentei: Kanji loaded from 'kentei/*' - Kanji Kentei (漢字検定) that aren't any of the above types
  • Ucd: Kanji that are in 'ucd.txt', but not already one of the above types
  • None: Kanji that haven't been loaded from any files

Kanji Class Diagram

The following diagram shows the Kanji class hierarchy (8 classes are concrete). Most of the public methods are included, but the types are simplified for the diagram, i.e., std::optional<std::string> is shown as Optional<String>, std::vector<std::string> is shown as List<String>, etc..

Kanji Class Diagram

JLPT Kanji

Note that JLPT level lists are no longer official since 2010. Also, each level file only contains uniquely new Kanji for the level (as opposed to some N2 and N1 lists on the web that repeat some Kanji from earlier levels). The levels have the following number of Kanji:

  • N5: 103 -- all Jōyō
  • N4: 181 -- all Jōyō
  • N3: 361 -- all Jōyō
  • N2: 415 -- all Jōyō
  • N1: 1162 -- 911 Jōyō, 251 Jinmeiyō

All Kanji in levels N5 to N2 are in the Top 2501 frequency list, but N1 contains 25 Jōyō and 83 Jinmeiyō Kanji that are not in the Top 2501 frequency list.

Jōyō Kanji

Kyōiku (教育) Kanji grades are included in the Jōyō list. Here is a breakdown of the count per grade as well as how many per JLPT level per grade (None means not included in any of the JLPT levels)

Grade Total N5 N4 N3 N2 N1 None
1 80 57 15 8
2 160 43 74 43
3 200 3 67 130
4 200 20 180
5 185 2 149 34
6 181 3 105 73
S 1130 161 804 165
Total 2136 103 181 361 415 911 165

Total for all grades is the same as the total Jōyō (2136) and all are in the Top 2501 frequency list except for 99 S (Secondary School) Kanjis.

The program also loads the 214 official Kanji radicals (部首).

Data Directory

The data directory contains the following files:

  • jouyou.txt: loaded from here - note, the radicals in this list reflect the original radicals from Kāngxī Zìdiǎn / 康煕字典(こうきじてん) so a few characters have the radicals of their old form, i.e., 円 has radical 口 (from the old form 圓).
  • jinmei.txt: loaded from here and most of the readings from here
  • linked-jinmei.txt: loaded from here
  • frequency.txt: top 2501 frequency Kanji loaded from KanjiCards
  • extra.txt: holds details for 'extra Kanji of interest' not already in the above four files
  • ucd.txt: data extracted from Unicode 'UCD' (see parseUcdAllFlat.sh for details and links)
  • frequency-readings.txt: holds readings of some Top Frequency Kanji that aren't in Jouyou or Jinmei lists
  • radicals.txt: loaded from here
  • jlpt/n[1-5].txt: loaded from various sites such as FreeTag and JLPT Study.
  • kentei/k*.txt: loaded from here
  • jukugo/*.txt: loaded from here
  • meaning-groups.txt: meant to hold groups of Kanji with related meanings (see Group.h for more details) - some ideas came from here
  • pattern-groups.txt: meant to hold groups of Kanji with related patterns (see Group.h for more details)

No external databases are used so far, but while writing some of the code (like in UnicodeBlock.h for example), the following links were very useful: Unicode Office Site - Charts and Compat.

The following 'strokes' related files used to be in the data directory, but strokes are now loaded from ucd.txt and used for all Kanji types except Jouyou and Extra (their files have a Strokes column). Ucd data has some unexpected stroke counts here and there (see parseUcdAllFlat.sh for a more detailed explanation), but so did the below files:

  • strokes.txt: loaded from here - covers Jinmeiyō Kanji and some old forms.
  • wiki-strokes.txt: loaded from here - mainly Jōyō, but also includes a few 'Frequency' type Kanji.

Kanji Quiz

The kanjiQuiz program supports running various types of quizzes (in review or test mode) as well as looking up details of a Kanji from the command-line. If no options are provided then the user is prompted for mode, quiz type, etc. or command-line options can be used to jump directly to the desired type of quiz or Kanji lookup. The following is the output from the -h (help) option:

kanjiQuiz [-hs] [-f[1-5] | -g[1-6s] | -k[1-9a-c] | -l[1-5] -m[1-4] | -p[1-4]]
          [-r[num] | -t[num]] [kanji]
    -h   show this help message for command-line options
    -s   show English meanings by default (can be toggled on/off later)

  The following options allow choosing the quiz/review type optionally followed
  by question list type (grade, level, etc.) instead of being prompted:
    -f   'frequency' (optional frequency group '0-9')
    -g   'grade' (optional grade '1-6', 's' = Secondary School)
    -k   'kyu' (optional Kentei Kyu '1-9', 'a' = 10, 'b' = 準1級, 'c' = 準2級)
    -l   'level' (optional JLPT level number '1-5')
    -m   'meaning' (optional Kanji type '1-4')
    -p   'pattern' (optional Kanji type '1-4')

  The following options can be followed by a 'num' to specify where to start in
  the question list (use negative to start from the end or 0 for random order).
    -r   review mode
    -t   test mode

  kanji  show details for a Kanji instead of starting a review or test

Examples:
  kanjiQuiz -f        # start 'frequency' quiz (prompts for 'bucket' number)
  kanjiQuiz -r40 -l1  # start 'JLPT N1' review beginning at the 40th entry

Note: 'kanji' can be UTF-8, frequency (between 1 and 2501), 'm' followed by
Morohashi ID (index in Dai Kan-Wa Jiten), 'n' followed by Classic Nelson ID
or 'u' followed by Unicode. For example, theses all produce the same output:
  kanjiQuiz 奉
  kanjiQuiz 1624
  kanjiQuiz m5894
  kanjiQuiz n212
  kanjiQuiz u5949

When using the quiz program to lookup a Kanji, the output includes a brief legend followed by some details such as Radical, Strokes, Pinyin, Frequency, Old or New variants, Meaning, Reading, etc.. The Similar list comes from the pattern-groups.txt file and (the very ad-hoc) Category comes from the meaning-groups.txt file. Morohashi and Nelson IDs are shown if they exist as well as any Jukugo examples loaded from data/jukugo files (there are only about 18K Jukugo entries so these lists are pretty limited).

~/cdev/kanji-tools $ ./build/apps/kanjiQuiz 龍
>>> Legend:
Fields: N[1-5]=JLPT Level, K[1-10]=Kentei Kyu, G[1-6]=Grade (S=Secondary School)
Suffix: .=常用 '=JLPT "=Freq ^=人名用 ~=LinkJ %=LinkO +=Extra @=検定 #=1級 *=Ucd

Showing details for 龍 [9F8D], Block CJK, Version 1.1, LinkedJinmei
Rad 龍(212), Strokes 16, lóng, Frq 1734, New 竜*
    Meaning: dragon
    Reading: リュウ、たつ
    Similar: 襲. 籠. 寵^ 瀧~ 朧+ 聾@ 壟# 蘢# 隴# 瓏#
  Morohashi: 48818
 Nelson IDs: 3351 5440
   Category: [動物:爬虫類]
     Jukugo: 龍頭蛇尾(りゅうとうだび) 烏龍茶(うーろんちゃ) 画龍点睛(がりょうてんせい)

Here are some runtime memory and (statically linked) file sizes for kanjiQuiz. Stats are more relevant for the quiz program compared to the others since it loads more Kanji related data including groups and jukugo. Sanitize stats are only available for Clang (this is the default debug setup when building the project) - they cause a lot more runtime memory to be used.

Kanji Quiz Runtime Memory

Compiler Debug Sanitize Debug Release
Clang 124.4 MB 24.4 MB 24.7 MB
GCC 34.3 MB 33.8 MB

Kanji Quiz Binary File Size

Compiler Debug Sanitize Debug Release
Clang 14 MB 10 MB 883 KB
GCC 4.6 MB 1.2 KB

Kanji Stats

The kanjiStats program takes a list of one or more files (or directories) and outputs a summary of counts of various types of multi-byte characters. More detailed information can also be shown depending on command-line options. In order to get more accurate stats about percentages of Kanji, Hiragana and Katakana, the program attempts to strip away all Furigana before counting.

Here is the output from processing a set of files containing lyrics for 中島みゆき (Miyuki Nakajima) songs:

~/cdev/kanji-tools $ ./build/apps/kanjiStats ~/songs
>>> Stats for: 'songs' (634 files from 62 directories) - showing top 5 Kanji per type
>>> Furigana Removed: 436, Combining Marks Replaced: 253, Variation Selectors: 0
>>>         Hiragana: 146379, unique:   77
>>>         Katakana:   9315, unique:   79
>>>     Common Kanji:  52406, unique: 1642, 100.00%
>>>        [Jouyou] :  50804, unique: 1398,  96.94%  (人 1440, 私 836, 日 785, 見 750, 何 626)
>>>        [Jinmei] :    986, unique:  114,   1.88%  (逢 95, 叶 68, 淋 56, 此 44, 遥 42)
>>>  [LinkedJinmei] :     36, unique:    7,   0.07%  (駈 13, 龍 10, 遙 5, 凛 3, 國 2)
>>>     [Frequency] :    203, unique:   15,   0.39%  (嘘 112, 叩 15, 呑 15, 頬 12, 叱 11)
>>>         [Extra] :    377, unique:  108,   0.72%  (怯 29, 騙 21, 囁 19, 繋 19, 禿 16)
>>>   MB-Punctuation:    946, unique:   12
>>>        MB-Symbol:     13, unique:    2
>>>        MB-Letter:   1429, unique:   54
>>> Total Kana+Kanji: 208100 (Hiragana: 70.3%, Katakana: 4.5%, Kanji: 25.2%)

Aozora

There is also a tests/stats/sample-data directory that contains files used for testing. The wiki-articles directory contains text from several wiki pages and books contains text from books found on 青空文庫 (Aozora Bunko) (with furigana preserved in wide brackets).

The books pulled from Aozora were in Shift JIS format so the following steps were used on macOS to convert them to UTF-8:

  • Load the HTML version of the book in Safari
  • Select All, then Copy-Paste to Notes - this keeps the furigana, but puts it on a separate line
  • Open file1 in Terminal using vi and paste in the text from Notes, then save and exit.
    • Copying straight from the browser to vi puts the furigana immediately after the Kanji (with no space, brackets, newline, etc.) which makes it pretty much impossible to 'regex' it out when producing stats (and difficult to read as well).
    • Extremely rare Kanji that are just embedded images in the HTML (instead of real Shift JIS values) do show up in Notes, but of course they don't end up getting pasted into the plain text file in vi. These need to be entered by hand (by choosing the closest Unicode equivalent).
    • MS Word also captures the furigana from the HTML, but it ends up being above unrelated text. When pasting to vi the furigana is put in standard brackets, but in incorrect locations which makes it useless (but at least it can be easily removed which is better than the straight to vi option). However, a more serious problem is that MS Word (macOS version 2019) also seemed to randomly drop parts of the text (maybe an encoding conversion issue?) which was a showstopper.
  • Run the kanjiFormat program (from build/apps) on file1 and redirect the output to file2
  • file2 should now have properly formatted furigana in wide brackets following the Kanji Sequence on the same line.
  • run 'fold file2>file1' to split up the really long lines to 80 columns.

Helpful Commands

Below are some bash commands that were used while creating this project:

# re-order columns
awk -F'[\t]' -v OFS="\t" '{print $1,$2,$4,$5,$3,$6,$7,$8,$9}' file
# re-number file assuming first column should be a number column starting at 1
awk -F'[\t]' -v OFS="\t" 'NR==1{print}NR>1{for(i=1;i<=NF;i++) printf "%s",(i>1 ? OFS $i : NR-1);print ""}' file
# convert wide numbers to normal single byte numbers (and delete a character)
cat file|tr '1234567890' '1234567890'|tr -d ''
# add newline to end of a file if missing (skip build, out and .git dirs)
find . -not \( -name build -prune \) -not \( -name .git -prune \) -not \( -name out -prune \) -type f | while read f; do tail -n1 $f | read -r _ || echo >> $f; done

About

C++ tools for working with multi-byte text, mainly focused on Japanese Kanji and Kana.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published