Skip to content

Commit

Permalink
Clarifying readme about Chinese space tokenization
Browse files Browse the repository at this point in the history
  • Loading branch information
jacobdevlin-google committed Nov 24, 2018
1 parent 332a687 commit a9ba4b8
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions multilingual.md
Original file line number Diff line number Diff line change
Expand Up @@ -176,8 +176,8 @@ weighted the same way as the data, so low-resource languages are upweighted by
some factor. We intentionally do *not* use any marker to denote the input
language (so that zero-shot training can work).

Because Chinese does not have whitespace characters, we add spaces around every
character in the
Because Chinese (and Japanese Kanji and Korean Hanja) does not have whitespace
characters, we add spaces around every character in the
[CJK Unicode range](https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_\(Unicode_block\))
before applying WordPiece. This means that Chinese is effectively
character-tokenized. Note that the CJK Unicode block only includes
Expand Down

0 comments on commit a9ba4b8

Please sign in to comment.