[TOC] GitHub automatically generates a Table of Contents in the header since this change in April 2021. Note, relative links to directories and some file types don't work from the Doxygen generated main page.
This repository contains code for four main programs:
- kanaConvert: program that converts between Hiragana, Katakana and Rōmaji
- kanjiFormat: program used to format sample-data files (from 青空文庫 - see below)
- kanjiQuiz: interactive program that allows a user to choose from various types of quizzes
- kanjiStats: classifies and counts multi-byte characters in a file or directory tree
The initial goal for this project was to create a program that could parse multi-byte (UTF-8) input and classify Japanese Kanji (漢字) characters into official categories in order to determine how many Kanji fall into each category in real-world examples. The quiz program was added later once the initial work was done for loading and classifying Kanji. The format program was created to help with a specific use-case that came up while gathering sample text from Aozora - it's a small program that relies on some of the generic code created for the stats program.
The project is build using cmake (installed via Homebrew) so there is a CMakeLists.txt file in the top directory that builds five libs (C++ static libraries for now), the four main programs (mentioned in the Introduction) plus all the test code. The tests are written using GoogleTest test framework. The project has the following directories:
- apps: CMakeLists.txt and a .cpp file for each main program
- build: generated for build targets and cmake dependencies
- data: data files described in Kanji Data section
- docs: docs and PlantUML diagrams
- scripts: .sh bash scripts for working with Unicode data files
- libs: has a directory per lib, each containing:
- include: .h files for the lib
- src: CMakeLists.txt and .cpp files for the lib
- tests: has testMain.cpp, an include directory and a directory per lib:
- each sub-directory has CMakeLists.txt and .cpp files
The five libraries are:
- utils: utility classes used by all 4 main programs
- kana: code used by kanaConvert program (depends on utils lib)
- kanji: code for loading Kanji and Ucd data (depends on kana lib)
- stats: code used by kanjiStats program (depends on kanji lib)
- quiz: code used by kanjiQuiz program (depends on kanji lib)
The code was written using VS Code IDE on an M1 Mac and compiles with either clang++ (version 14.0.3) installed via Xcode command-line tools (xcode-select --install
) or g++-13 (version 13.1.0) installed via Homebrew (brew install gcc
). Some other useful brew formulas for this project are: bash
, clang-format
, cmake
, doxygen
and gcovr
). It should also build on other Unix/Linux systems, but there are assumptions related to wchar_t
and multi-byte handling that won't currently compile on Windows 10.
Here are some links that might help with setup:
Here's a list of VS Code extensions being used:
- Code/Build: C/C++, CMake Tools, CMake Language Support, Makefile Tools, .NET
- Format: Code Spell Checker, shell-format
- Test: C++ TestMate, Test Adapter Converter, Test Explorer UI
- Debug: CodeLLDB, Hex Editor
- Docs: Doxygen Document Generator, PlantUML
- Aesthetic: Better Comments, Clock
Notes:
- Better Comments: can help distinguish
///
Doxygen comments by using a different color for tag "/" in "better-comments.tags" (in User Settings) - Code Spell Checker: there are lots of word entries for this extension in .vscode/settings.json (mainly caused by all the Japanese words in test code)
- CodeLLDB: current setup has some limitations (see comments in .vscode/launch.json for more details)
- PlantUML is used to generate diagrams from the .txt files in docs/diagrams/src. In order to generate them locally graphviz must be installed. On Mac this can be done via
brew install --cask temurin; brew install graphviz
The code builds without warnings using a large set of diagnostic flags such as -Wall, -Wextra (equivalent to -W), -Wconversion, etc.. -Werror is also included to ensure the code remains warning-free. Finally, only one type of warning has been disabled (requiring parentheses for some expressions that seemed excessive). clang-tidy (which is nicely integrated with VS Code) is also being used for diagnostics (see .clang-tidy for details on what's being checked).
The following table shows flags used per compiler (Common shows flags used for both). Diagnostics enabled by default or enabled via another flag such as -Wall are not included (at least that's the intention):
Compiler | Standard | Diagnostic Flags | Disabled |
---|---|---|---|
Common | -Wall -Wconversion -Wdeprecated -Werror -Wextra -Wextra-semi -Wignored-qualifiers -Wnonnull -Wold-style-cast -Wpedantic -Wsuggest-override -Wswitch-enum -Wzero-as-null-pointer-constant | ||
Clang | c++2a | -Wcovered-switch-default -Wduplicate-enum -Wheader-hygiene -Wloop-analysis -Wshadow-all -Wsuggest-destructor-override -Wunreachable-code-aggressive | -Wno-logical-op-parentheses |
GCC | c++20 | -Wnon-virtual-dtor -Woverloaded-virtual -Wshadow -Wuseless-cast | -Wno-parentheses |
Notes:
- -Wpedantic means 'Issue all the warnings demanded by strict ISO C++'
- -Wswitch-enum happens even when a 'default' label exists (so it's very strict) whereas -Wswitch (enabled by default) only warns about a missing enum value when there is no 'default'.
- Here are links for more info:
An effort was made to use modern C++ features including C++ 11 std::move
, std::forward
, std::make_shared
, nullptr
, noexcept
, constexpr
, etc.. Also, uniform initialization and type inference are used whenever possible for consistency. Below, are lists of some specific features used from the latest three C++ standard versions:
C++ 20:
concept
,consteval
andconstinit
, initializers in range-basedfor
[=, this]
as lambda capture, templated lambdas- three-way comparison, class types in non-typed templates
using
on scoped enums,contains
,starts_with
andends_with
C++ 17:
if constexpr (expression)
,[[nodiscard]]
, initializers inif
andswitch
auto
for non-typed template, new rules forauto
type deduction- class template argument deduction (CTAD), so don't need
std::make_pair
, etc.. - inline variables (don't violate one definition rule), optional
static_assert
message std::filesystem
,std::string_view
,std::optional
andstd::size
_v
helpers instead ofvalue
, i.e.,std::is_unsigned_v
C++ 14:
- function return deduction, relaxed constexpr restrictions, variable templates
- binary literals, digit separators, generic lambdas
_t
helpers instead oftype
, i.e.,std::underlying_type_t
The kanaConvert program was created to parse the UniHan XML files (from Unicode Consortium) which have 'On' (音) and 'Kun' (訓) readings, but only in Rōmaji. The program can read stdin and supports various flags for controlling conversion (like Hepburn or Kunrei) and it has an interactive mode. Here are some examples:
$ kanaConvert atatakai
あたたかい
$ kanaConvert kippu
きっぷ
$ echo kippu | kanaConvert -k # can be used in pipes
キップ
$ echo ジョン・スミス | kanaConvert -r
jon/sumisu
$ echo かんよう かんじ | kanaConvert -r
kan'you kanji
$ kanaConvert -r ラーメン # uses macrons when converting from 'prolong mark'
rāmen
$ kanaConvert -h rāmen
らーめん
$ kanaConvert -r こゝろ # supports repeat marks
kokoro
$ kanaConvert -r スヾメ
suzume
$ kanaConvert -k qarutetto # supports multiple romaji variants:
クァルテット
$ kanaConvert -k kwarutetto
クァルテット
Passing '-p' to kanaConvert causes it to print out a Kana Chart that shows the Rōmaji letter combinations that are supported along with some notes and totals. The output is aligned properly in a terminal using a fixed font (or an IDE depending on the font - see Table.h for more details). However, the output doesn't align properly in a Markdown code block (wide to narrow character ratio isn't exactly 2:1) so there's also a '-m' option to print using markdown formatting.
- Note: the terminal output (-p) puts a border line between sections (sections for the Kana chart table are groups of related Kana symbols, i.e., 'a', 'ka', 'sa', etc.), but for markdown (-m) rows starting a section are in bold instead:
The following diagram shows the Kana class hierarchy as well as some of the public methods.
See Kana.h for details, but in summary, the derived classes are:
- DakutenKana: represents a Kana that has a dakuten version. It holds an AccentedKana accessible via the overridden
dakuten()
method to return the accented form, i.e., [ka, か, カ] is an instance of DakutenKana and callingdakuten()
on it returns [ga, が, ガ] - HanDakutenKana: derives from DakutenKana and holds an AccentedKana accessible via the overridden
hanDakuten()
method - this class is used for ha-gyō (は-行) Kana which have both dakuten and hanDakuten versions. - AccentedKana: has a pointer back to its plain holder
To support kanjiStats and kanjiQuiz programs, KanjiData class loads and breaks down Kanji into the following categories:
- Jouyou: 2136 official Jōyō (常用) Kanji
- Jinmei: 633 official Jinmeiyō (人名用) Kanji
- LinkedJinmei: 230 more Jinmei Kanji that are old/variant forms of Jōyō (212) or Jinmei (18)
- LinkedOld: 213 old/variant Jōyō Kanji that aren't in 'Linked Jinmei'
- Frequency: Kanji that are in the top 2501 frequency list, but not one of the first 4 types
- Extra: Kanji loaded from 'extra.txt' - shouldn't be in any of the above types
- Kentei: Kanji loaded from 'kentei/*' - Kanji Kentei (漢字検定) that aren't any of the above types
- Ucd: Kanji that are in 'ucd.txt', but not already one of the above types
- None: Kanji that haven't been loaded from any files
The following diagram shows the Kanji class hierarchy (8 classes are concrete). Most of the public methods are included, but the types are simplified for the diagram, i.e., std::optional<std::string>
is shown as Optional<String>
, std::vector<std::string>
is shown as List<String>
, etc..
Note that JLPT level lists are no longer official since 2010. Also, each level file only contains uniquely new Kanji for the level (as opposed to some N2 and N1 lists on the web that repeat some Kanji from earlier levels). The levels have the following number of Kanji:
- N5: 103 -- all Jōyō
- N4: 181 -- all Jōyō
- N3: 361 -- all Jōyō
- N2: 415 -- all Jōyō
- N1: 1162 -- 911 Jōyō, 251 Jinmeiyō
All Kanji in levels N5 to N2 are in the Top 2501 frequency list, but N1 contains 25 Jōyō and 83 Jinmeiyō Kanji that are not in the Top 2501 frequency list.
Kyōiku (教育) Kanji grades are included in the Jōyō list. Here is a breakdown of the count per grade as well as how many per JLPT level per grade (None means not included in any of the JLPT levels)
Grade | Total | N5 | N4 | N3 | N2 | N1 | None |
---|---|---|---|---|---|---|---|
1 | 80 | 57 | 15 | 8 | |||
2 | 160 | 43 | 74 | 43 | |||
3 | 200 | 3 | 67 | 130 | |||
4 | 200 | 20 | 180 | ||||
5 | 185 | 2 | 149 | 34 | |||
6 | 181 | 3 | 105 | 73 | |||
S | 1130 | 161 | 804 | 165 | |||
Total | 2136 | 103 | 181 | 361 | 415 | 911 | 165 |
Total for all grades is the same as the total Jōyō (2136) and all are in the Top 2501 frequency list except for 99 S (Secondary School) Kanjis.
The program also loads the 214 official Kanji radicals (部首).
The data directory contains the following files:
- jouyou.txt: loaded from here - note, the radicals in this list reflect the original radicals from Kāngxī Zìdiǎn / 康煕字典(こうきじてん) so a few characters have the radicals of their old form, i.e., 円 has radical 口 (from the old form 圓).
- jinmei.txt: loaded from here and most of the readings from here
- linked-jinmei.txt: loaded from here
- frequency.txt: top 2501 frequency Kanji loaded from KanjiCards
- extra.txt: holds details for 'extra Kanji of interest' not already in the above four files
- ucd.txt: data extracted from Unicode 'UCD' (see parseUcdAllFlat.sh for details and links)
- frequency-readings.txt: holds readings of some Top Frequency Kanji that aren't in Jouyou or Jinmei lists
- radicals.txt: loaded from here
- jlpt/n[1-5].txt: loaded from various sites such as FreeTag and JLPT Study.
- kentei/k*.txt: loaded from here
- jukugo/*.txt: loaded from here
- meaning-groups.txt: meant to hold groups of Kanji with related meanings (see Group.h for more details) - some ideas came from here
- pattern-groups.txt: meant to hold groups of Kanji with related patterns (see Group.h for more details)
No external databases are used so far, but while writing some of the code (like in UnicodeBlock.h for example), the following links were very useful: Unicode Office Site - Charts and Compat.
The following 'strokes' related files used to be in the data directory, but strokes are now loaded from ucd.txt and used for all Kanji types except Jouyou and Extra (their files have a Strokes column). Ucd data has some unexpected stroke counts here and there (see parseUcdAllFlat.sh for a more detailed explanation), but so did the below files:
- strokes.txt: loaded from here - covers Jinmeiyō Kanji and some old forms.
- wiki-strokes.txt: loaded from here - mainly Jōyō, but also includes a few 'Frequency' type Kanji.
The kanjiQuiz program supports running various types of quizzes (in review or test mode) as well as looking up details of a Kanji from the command-line. If no options are provided then the user is prompted for mode, quiz type, etc. or command-line options can be used to jump directly to the desired type of quiz or Kanji lookup. The following is the output from the -h
(help) option:
kanjiQuiz [-hs] [-f[1-5] | -g[1-6s] | -k[1-9a-c] | -l[1-5] -m[1-4] | -p[1-4]]
[-r[num] | -t[num]] [kanji]
-h show this help message for command-line options
-s show English meanings by default (can be toggled on/off later)
The following options allow choosing the quiz/review type optionally followed
by question list type (grade, level, etc.) instead of being prompted:
-f 'frequency' (optional frequency group '0-9')
-g 'grade' (optional grade '1-6', 's' = Secondary School)
-k 'kyu' (optional Kentei Kyu '1-9', 'a' = 10, 'b' = 準1級, 'c' = 準2級)
-l 'level' (optional JLPT level number '1-5')
-m 'meaning' (optional Kanji type '1-4')
-p 'pattern' (optional Kanji type '1-4')
The following options can be followed by a 'num' to specify where to start in
the question list (use negative to start from the end or 0 for random order).
-r review mode
-t test mode
kanji show details for a Kanji instead of starting a review or test
Examples:
kanjiQuiz -f # start 'frequency' quiz (prompts for 'bucket' number)
kanjiQuiz -r40 -l1 # start 'JLPT N1' review beginning at the 40th entry
Note: 'kanji' can be UTF-8, frequency (between 1 and 2501), 'm' followed by
Morohashi ID (index in Dai Kan-Wa Jiten), 'n' followed by Classic Nelson ID
or 'u' followed by Unicode. For example, theses all produce the same output:
kanjiQuiz 奉
kanjiQuiz 1624
kanjiQuiz m5894
kanjiQuiz n212
kanjiQuiz u5949
When using the quiz program to lookup a Kanji, the output includes a brief legend followed by some details such as Radical, Strokes, Pinyin, Frequency, Old or New variants, Meaning, Reading, etc.. The Similar list comes from the pattern-groups.txt file and (the very ad-hoc) Category comes from the meaning-groups.txt file. Morohashi and Nelson IDs are shown if they exist as well as any Jukugo examples loaded from data/jukugo files (there are only about 18K Jukugo entries so these lists are pretty limited).
~/cdev/kanji-tools $ ./build/apps/kanjiQuiz 龍
>>> Legend:
Fields: N[1-5]=JLPT Level, K[1-10]=Kentei Kyu, G[1-6]=Grade (S=Secondary School)
Suffix: .=常用 '=JLPT "=Freq ^=人名用 ~=LinkJ %=LinkO +=Extra @=検定 #=1級 *=Ucd
Showing details for 龍 [9F8D], Block CJK, Version 1.1, LinkedJinmei
Rad 龍(212), Strokes 16, lóng, Frq 1734, New 竜*
Meaning: dragon
Reading: リュウ、たつ
Similar: 襲. 籠. 寵^ 瀧~ 朧+ 聾@ 壟# 蘢# 隴# 瓏#
Morohashi: 48818
Nelson IDs: 3351 5440
Category: [動物:爬虫類]
Jukugo: 龍頭蛇尾(りゅうとうだび) 烏龍茶(うーろんちゃ) 画龍点睛(がりょうてんせい)
Here are some runtime memory and (statically linked) file sizes for kanjiQuiz. Stats are more relevant for the quiz program compared to the others since it loads more Kanji related data including groups and jukugo. Sanitize stats are only available for Clang (this is the default debug setup when building the project) - they cause a lot more runtime memory to be used.
Compiler | Debug Sanitize | Debug | Release |
---|---|---|---|
Clang | 124.4 MB | 24.4 MB | 24.7 MB |
GCC | 34.3 MB | 33.8 MB |
Compiler | Debug Sanitize | Debug | Release |
---|---|---|---|
Clang | 14 MB | 10 MB | 883 KB |
GCC | 4.6 MB | 1.2 KB |
The kanjiStats program takes a list of one or more files (or directories) and outputs a summary of counts of various types of multi-byte characters. More detailed information can also be shown depending on command-line options. In order to get more accurate stats about percentages of Kanji, Hiragana and Katakana, the program attempts to strip away all Furigana before counting.
Here is the output from processing a set of files containing lyrics for 中島みゆき (Miyuki Nakajima) songs:
~/cdev/kanji-tools $ ./build/apps/kanjiStats ~/songs
>>> Stats for: 'songs' (634 files from 62 directories) - showing top 5 Kanji per type
>>> Furigana Removed: 436, Combining Marks Replaced: 253, Variation Selectors: 0
>>> Hiragana: 146379, unique: 77
>>> Katakana: 9315, unique: 79
>>> Common Kanji: 52406, unique: 1642, 100.00%
>>> [Jouyou] : 50804, unique: 1398, 96.94% (人 1440, 私 836, 日 785, 見 750, 何 626)
>>> [Jinmei] : 986, unique: 114, 1.88% (逢 95, 叶 68, 淋 56, 此 44, 遥 42)
>>> [LinkedJinmei] : 36, unique: 7, 0.07% (駈 13, 龍 10, 遙 5, 凛 3, 國 2)
>>> [Frequency] : 203, unique: 15, 0.39% (嘘 112, 叩 15, 呑 15, 頬 12, 叱 11)
>>> [Extra] : 377, unique: 108, 0.72% (怯 29, 騙 21, 囁 19, 繋 19, 禿 16)
>>> MB-Punctuation: 946, unique: 12
>>> MB-Symbol: 13, unique: 2
>>> MB-Letter: 1429, unique: 54
>>> Total Kana+Kanji: 208100 (Hiragana: 70.3%, Katakana: 4.5%, Kanji: 25.2%)
There is also a tests/stats/sample-data directory that contains files used for testing. The wiki-articles directory contains text from several wiki pages and books contains text from books found on 青空文庫 (Aozora Bunko) (with furigana preserved in wide brackets).
The books pulled from Aozora were in Shift JIS format so the following steps were used on macOS to convert them to UTF-8:
- Load the HTML version of the book in Safari
- Select All, then Copy-Paste to Notes - this keeps the furigana, but puts it on a separate line
- Open file1 in Terminal using vi and paste in the text from Notes, then save and exit.
- Copying straight from the browser to vi puts the furigana immediately after the Kanji (with no space, brackets, newline, etc.) which makes it pretty much impossible to 'regex' it out when producing stats (and difficult to read as well).
- Extremely rare Kanji that are just embedded images in the HTML (instead of real Shift JIS values) do show up in Notes, but of course they don't end up getting pasted into the plain text file in vi. These need to be entered by hand (by choosing the closest Unicode equivalent).
- MS Word also captures the furigana from the HTML, but it ends up being above unrelated text. When pasting to vi the furigana is put in standard brackets, but in incorrect locations which makes it useless (but at least it can be easily removed which is better than the straight to vi option). However, a more serious problem is that MS Word (macOS version 2019) also seemed to randomly drop parts of the text (maybe an encoding conversion issue?) which was a showstopper.
- Run the kanjiFormat program (from build/apps) on file1 and redirect the output to file2
- file2 should now have properly formatted furigana in wide brackets following the Kanji Sequence on the same line.
- run 'fold file2>file1' to split up the really long lines to 80 columns.
Below are some bash commands that were used while creating this project:
# re-order columns
awk -F'[\t]' -v OFS="\t" '{print $1,$2,$4,$5,$3,$6,$7,$8,$9}' file
# re-number file assuming first column should be a number column starting at 1
awk -F'[\t]' -v OFS="\t" 'NR==1{print}NR>1{for(i=1;i<=NF;i++) printf "%s",(i>1 ? OFS $i : NR-1);print ""}' file
# convert wide numbers to normal single byte numbers (and delete a character)
cat file|tr '1234567890' '1234567890'|tr -d '画'
# add newline to end of a file if missing (skip build, out and .git dirs)
find . -not \( -name build -prune \) -not \( -name .git -prune \) -not \( -name out -prune \) -type f | while read f; do tail -n1 $f | read -r _ || echo >> $f; done