Render non-BMP CJKV characters locally #4550

1ec5 · 2024-08-13T08:56:46Z

Generalized glyph management and shaping code to operate on UTF-16 characters instead of code units, no longer splitting surrogate pairs. Enabled local glyph rendering for additional ideographic code blocks in the Supplementary Ideographic Plane (SIP) and Tertiary Ideographic Plane (TIP) for improved Chinese, Japanese, Korean, and Vietnamese (CJKV) character coverage.

Enabled local glyph rendering in some additional CJKV code blocks in the SIP and TIP, as well as in Tangut, which is thrash-prone like CJKV. Disabled letter spacing for more cursive scripts.

This is a step towards #2307, which tracks support for characters outside the Basic Multilingual Plane more generally, not just for CJKV. This PR depends on #4560.

Demonstration

Before	After
那🌫️市镇	那𧯄市镇
丐🌫️市镇	丐𦨭市镇
拜🌫️社	拜𩡋社

Launch Checklist

Confirm your changes do not include backports from Mapbox projects (unless with compliant license) - if you are not sure about this, please ask!
Briefly describe the changes in this PR.
Link to related issues.
Include before/after visuals or gifs if this PR includes visual changes.
Write tests for all new functionality.
Document any changes to public APIs.
Post benchmark scores.
Add an entry to CHANGELOG.md under the ## main section.

codecov-commenter · 2024-08-13T09:06:30Z

Codecov Report

Attention: Patch coverage is 56.81818% with 76 lines in your changes missing coverage. Please review.

Project coverage is 88.00%. Comparing base (42d6847) to head (acdb77d).
Report is 11 commits behind head on main.

Files	Patch %	Lines
src/util/script_detection.ts	4.44%	2 Missing and 41 partials ⚠️
src/util/is_char_in_unicode_block.ts	70.00%	0 Missing and 15 partials ⚠️
src/symbol/shaping.ts	75.86%	12 Missing and 2 partials ⚠️
src/data/bucket/symbol_bucket.ts	50.00%	2 Missing ⚠️
src/util/verticalize_punctuation.ts	86.66%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4550      +/-   ##
==========================================
+ Coverage   87.74%   88.00%   +0.26%     
==========================================
  Files         247      247              
  Lines       33494    33600     +106     
  Branches     2414     2430      +16     
==========================================
+ Hits        29390    29571     +181     
+ Misses       3118     3026      -92     
- Partials      986     1003      +17

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/symbol/shaping.ts

src/render/glyph_manager.ts

src/util/script_detection.ts

HarelM · 2024-08-13T09:11:47Z

test/integration/render/tests/text-local-ideographs/cjk/style.json

@@ -21,16 +21,18 @@
                "properties": {
                    "name_en": "abcde",
                    "name_ja": "あいうえお",
-                    "name_zh": "阿衣乌唉哦",
-                    "name_ko": "아이우"
+                    "name_zh": "阿衣乌唉哦𪚥",


Is this the only render test that cover these changes? Do we need other tests?

src/util/verticalize_punctuation.ts

HarelM · 2024-08-13T09:19:55Z

src/util/verticalize_punctuation.ts

        }
+
+        prevChar = {value: char.value, premature: false};


I'm not sure I understand the permutate flag here.
This seems like a change of logic from the original method, so I would consider adding some unit tests here to make sure we get the right results.

prevChar is analogous to the old prevCharCode, but it parallels what the iterators return. This is using premature to flag a not-yet-set state, similar to done in the iterators’ return values. I agree that some more vertical layout tests are needed to exercise these changes.

I think unit tests can cover the logic in this method, and a render test or two can help here as well to keep parity with native and make sure this is working as expected.

HarelM · 2024-08-13T09:26:44Z

You can download the html report from the CI and commit the image that you download from the CI so that the CI will pass, assuming the image is correct.

test/integration/render/tests/text-local-ideographs/cjk/expected-mac.png

HarelM · 2024-08-13T09:38:41Z

CC: @louwers there are expected changes in parity here maybe, just FYI.

src/render/glyph_manager.ts

HarelM · 2024-08-13T09:42:52Z

src/symbol/shaping.ts

@@ -75,7 +75,7 @@ export type TextJustify = 'left' | 'center' | 'right';
 const PUAbegin = 0xE000;
 const PUAend = 0xF8FF;

-class SectionOptions {
+export class SectionOptions {


Do we really need to export those classes? Can we test the content of this file without exporting those?

The new tests need to refer to these classes. Would you suggest a helper function that encapsulates the creation of a TaggedString, but only for testing purposes?

maplibre-gl-js/src/symbol/shaping.test.ts

Lines 17 to 19 in 249b360

const tagged = new TaggedString();

tagged.text = ' \t \v ';

tagged.sections = [new SectionOptions()];

No, as far as I understand (which is not a lot), these classes were used inside this file by other exported classes or methods.
If this is the case, I think it might be better to test the exported classes/methods in the unit test in a flow that reaches these non exported classes.
If this is not possible, I would then prefer to have these classes moved to their own file and have unit test to these new files separately. In general, I like it better when I have a single file for a single class, but that might just be me...

I think we’ll have to split the classes into separate files. The other exported functions in this file do so much that it would be difficult to isolate the most important behavior that these tests are intended to exercise.

HarelM · 2024-08-13T09:45:48Z

BTW, is this a breaking change in terms of behavior of the map/text/fonts?

wipfli · 2024-08-13T09:46:57Z

BTW, is this a breaking change in terms of behavior of the map/text/fonts?

No

1ec5 · 2024-08-13T16:53:35Z

src/util/is_char_in_unicode_block.ts

+    'Supplementary Private Use Area-A': (char) => char >= 0xF0000 && char <= 0xFFFFF,
+    'Supplementary Private Use Area-B': (char) => char >= 0x100000 && char <= 0x10FFFF


@wipfli, if I understand this article correctly, your solution for Devanagari text shaping uses Private Use Area codepoints but has to steer clear of image placeholders that are also assigned to this block. With the changes in this PR, you could probably use the much more spacious Supplementary Private Use Areas instead. Most CJKV fonts informally use these blocks for characters that haven’t been officially encoded yet, but I wouldn’t expect vector tiles or GeoJSON to take advantage of these characters due to interoperability issues. You’d still be constrained by SymbolBucket.MAX_GLYPHS, but at least you could pick a different plane without worrying about overlapping with unrelated usage before you reach that limit.

That's correct, your pull request would be very useful for positioned glyph fonts!

HarelM · 2024-08-15T07:18:58Z

I've fixed it in the global branch, it's not that the size had tripled but only that you added 2k, which is fine, just increase the expected size so that the test will pass...

src/render/glyph_manager.ts

wipfli · 2024-08-15T08:17:37Z

Just to double-check, the MVT spec does support UTF-16?

1ec5 · 2024-08-15T08:44:25Z

Just to double-check, the MVT spec does support UTF-16?

Depends what you mean by “support”, but what we’re doing here is compatible with the MVT specification. As of MVT v2.1, a feature property value is a string, which the protocol buffer format defines as UTF-8–encoded text. Both UTF-8 and UTF-16 are variable-width encodings. The Chinese character 𪚥 (U+2A6A5) is encoded in UTF-8 as a sequence of four code units: F0 AA 9A A5. JavaScript decodes this sequence as a surrogate pair of two UTF-16 code units: D869 DEA5. Either way, it’s still a single character. Previously, we were incorrectly splitting the character in half, effectively assuming the obsolete UCS-2 encoding. Now we group the code units together by Unicode character, as fonts and users would expect. None of this is incompatible with UTF-8.

Although this is an improvement, it isn’t quite perfect: some characters and emoji require variation selectors and other control characters, so that what the user perceives as a grapheme consists of several UTF-16 characters. This PR doesn’t address that issue, because GL JS still pervasively assumes that one codepoint equals one glyph. If someday we eliminate that assumption, we can write custom code or use a Unicode set regular expression to match those sequences. (Unicode sets are very new, introduced to Firefox about a year ago.)

1ec5 · 2024-08-15T09:59:10Z

src/render/glyph_manager.ts

-                entry.ranges[range] = true;
-                return {stack, id, glyph: null};
-            } else {
-                throw new Error('glyphs > 65535 not supported');


I removed this error at @wipfli’s request in #4550 (comment), but unless we solve #4563, this will be a breaking change and a difficult one for developers to cope with.

Tricky situation indeed.

#4564 provides a fallback to local glyph rendering, so it won’t be a breaking change after all. The behavior is essentially what I floated in #4550 (comment), except that server-side glyphs take precedence over client-side glyphs for non-CJKV codepoints.

HarelM · 2024-08-16T19:48:44Z

src/util/script_detection.ts

    if (isChar['CJK Compatibility'](char)) return true;
+    if (isChar['CJK Compatibility Forms'](char)) return true;


This still looks like we are repeating the process of uncommenting a line in unicodeBlockLookup and then adding it as another "if" row here.
Is it possible to change this in one place instead of two?

Do you mean that we should inline the lookup functions wherever we use them instead of keeping them in a separate lookup table? We might as well eliminate that file in favor of inline regular expressions like this:

if (/* CJK Compatibility Forms */ /[\uFE30-\uFE4F]/.test(String.fromCodePoint(char))) return true;

I think the separate lookup table was created originally out of concern that these character classes would be less readable or more error-prone. I don’t have a strong preference, but then again, I have a higher tolerance for regular expressions than most developers.

It's not about regular expression (I don't like them much, they are not readable when things get complicated), but more about using the same hard coded string in two places.
I don't mind keeping the original lookup table, but then why do we pass the same hardcoded string here again? Wht not use something like Object.keys(unicodeBlockLookup)...?

None of the functions that use unicodeBlockLookup should ever iterate over all the keys. Each function only selectively accesses the lookup functions relevant to the task at hand. As the table in #4560 (comment) illustrates, it’s a big Venn diagram; every function cares about a slightly different set of Unicode blocks. I tried to unify the functions somewhat, but inherently we need to list specific blocks in each function in order to have the correct behavior.

These strings are the official Unicode block names, not something we invented here, and the ISO standard basically guarantees that they won’t change. If the problem is that these strings add to the bundle size, we could CamelCase them (as in maplibre-native) and use the property syntax so that maybe the property names will get minified at compile time:

if (isChar.CJKCompatibilityForms(char)) return true;

No, I don't care about bundle size in this respect.
What I would like to achieve is the need to change in multiple places.
i.e. add all the needed information to the lookup table so that uncommenting a line and setting the right properties would be the only change need to support another script.

Just to be clear, the only reason we’d ever modify this file is that the Unicode standard itself has changed (to add another script). If this PR works as designed, then we’d have comprehensive support for the latest Unicode standard, and we shouldn’t have to manually add new scripts in general, though that is sort of what’s going on in charInSupportedScript, which I haven’t touched.

I think the above suggestion is better in terms that in encapsulates all the relevant properties to a single place and later on you won't need to worry about them, hopefully, or you could create another method to have a different "query" on this object.
You can also define the type of every element so that it adheres to the relevant model.
I find it more maintainable, but it might just be me, IDK...

I haven’t abandoned this PR. 🙂 In 1ec5#1, I’m prototyping, among other things, code generation for anything we’re doing based on the Unicode Character Database that isn’t directly supported by ECMAScript regular expressions. It has the potential to reduce maintenance overhead for script detection, though I suspect we’ll still need to roll a little bit of our own code for some remaining typographical functionality that isn’t expressed in the Unicode standard. Once the branch is in better shape (testing, performance), I’ll pick it apart and post formal PRs to this repository.

The bespoke build script I wrote in 1ec5/maplibre-gl-js@be6d7fe is simple and effective for this purpose, but I wonder if you’d prefer to pull in @unicode/unicode-15.1.0 via regenerate as a devDependency instead. It’s a big package to pull in at install time, but unlike my script, it won’t require fetching anything from the network. Either way, it’ll have negligible impact on the build, just a big regular expression that we fortunately wouldn’t have to maintain by hand.

Dev dependency is better I think.
Make sure to add the generated file to git ignore and add a .g.ts file extension to the generated file to indicate that it's generated.

1ec5 · 2024-08-16T23:47:12Z

If this PR lands before maplibre/maplibre-style-spec#778 is fixed, there will be regressions in some styles even for some features that only involve BMP characters.

HarelM · 2024-08-17T11:02:45Z

Style spec has a version of its own, so fixing it there first will be the right approach, after that we can merge this PR only with the upgrade of style-spec to the required version. Bottom line, there's no problem syncing it properly, we just need to make sure it is solved and applied correctly.

Generalized glyph management, shaping code to operate on UTF-16 characters instead of code units, no longer splitting surrogate pairs. Enabled local glyph rendering for additional ideographic code blocks in the Supplementary Ideographic Plane and Tertiary Ideographic Plane for improved Chinese, Japanese, Korean, and Vietnamese character coverage.

Enabled local glyph rendering in some additional CJKV code blocks, as well as in Tangut, which is thrash-prone like CJKV. Disabled letter spacing for more cursive scripts.

Removed the error that artificially prevented the glyph manager from requesting glyph PBFs for ranges beyond U+FFFF.

HarelM · 2024-09-06T15:05:09Z

@1ec5 version 20.3.1 of the style spec was merged to this repo.
Does it require any changes in this repo or this is more of a bug-fix or small improvement?
Is it related to this PR?
I would like to release a new version today-tomorrow but I don't know if I should revert the spec package update or not.
Please let me know.

1ec5 · 2024-09-06T15:14:07Z

The risk I mentioned before was only if we landed this PR before maplibre/maplibre-style-spec#779, but the other way around should be safe. I don’t think anything will break if you release GL JS with just the upgrade to the style-spec dependency, so there’s no need to revert. This PR needs some additional work on render tests, plus the dependency change in #4550 (comment) that I’ll be staging shortly in a separate PR.

HarelM · 2024-09-06T20:13:34Z

Thanks for the info! I've published a new version - 4.7.

1ec5 force-pushed the astral-cjk branch from b859c61 to bfbb18e Compare August 13, 2024 09:00