Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Render complex text, variant forms, emoji, etc. #1

Draft
wants to merge 18 commits into
base: astral-cjk
Choose a base branch
from

Conversation

1ec5
Copy link
Owner

@1ec5 1ec5 commented Aug 19, 2024

This branch adds experimental support for rendering text in Indic and other complex scripts, variant character forms, and combining diacritics. As a side effect, some emoji sequences now appear as single glyphs, though only as silhouettes.

The names of Bengaluru in various Indian languages curved around two concentric circles and of Naypyidaw in Burmese in another concentric circle. Indic text is properly shaped without any dangling or misplaced diacritical marks. In the corner, a few rows of miscellaneous CJK text, including a rare kanji. Below it, a label that contains an Egyptian hieroglyph of an elephant, a flag emoji sequence, wheelchair emoji, and woman in wheelchair emoji. The emoji are only silhouettes rather than in color. Below that, the name of Kolkata in Hindi with a gloss in English and the name of Sri Lanka in Sinhala. The syllable “sri” appears as a single ligature. To the left, the name of the Democratic Republic of Congo in Maldivian and some text in Central Kurdish and Arabic. Maldivian diacritics appear over the correct letters. Collision boxes are enabled.

Text segmentation

Currently, text-field strings are segmented by UTF-16 code units. maplibre#4550 refactors various text processing classes to segment by full UTF-16 characters instead, expanding text rendering support to the rest of the Unicode character repertoire. However, in many common situations, a single Unicode character in isolation cannot adequately represent a grapheme cluster that the user perceives as a single glyph. This branch segments strings by grapheme cluster for rendering purposes. Additionally, it refactors the glyph atlas to index glyph data by grapheme cluster strings, whereas currently it is indexed by codepoints. (Actually, the codepoints are converted to numbers and then stringified for maximum inefficiency, apparently for consistency with the glyph PBF format.)

Segmenting strings by grapheme cluster requires the Intl.Segmenter API, which is very new. Firefox was the last major holdout, adding support for this API only a few months ago. For older browsers, it will be necessary to fall back to segmenting by Unicode character. Grapheme cluster segmentation is not a panacea: maplibre/maplibre-native#778 (reply in thread) discusses some limitations around cursive scripts. Unless a workaround can be found, mapbox-gl-rtl-text plugin will probably continue to be required for Arabic typesetting.

The segmenter understands emoji sequences, including sequences that include zero-width joiners. However, it is only possible to render the emoji’s silhouette for the time being, because the glyph atlas only stores the alpha channel of the glyph image. Storing each of the color channels would enable the shader to draw color emoji, as demonstrated in 1ec5/tiny-sdf#1, but it would need to be limited to detected emoji sequences to avoid largely frivolous overhead.

The custom word breaking heuristics for determining line breaking opportunities have been replaced by a word segmenter. This introduces word wrapping for the first time to writing systems such as Thai and Khmer that don’t put spaces between words. It also keeps Hanzi/hanja/kanji compound words together based on the browser’s built-in compound dictionary. This obviates the server-side workaround that Mapbox introduced in mapbox/mapbox-gl-js#8255 (before the fork). If a tileset has inserted zero-width spaces between compounds as hints, as the Mapbox Streets source does, the word segmenter will continue to honor those hints as a matter of course, but this functionality now comes “for free” without any developer intervention.

Local text rendering

As TinySDF is my hammer, everything looks like a nail. This branch completely disables the glyph PBF pipeline for remotely rendered glyphs in favor of rendering every grapheme cluster locally through TinySDF, making maplibre#4564 unnecessary. The changes obviate much of the original reason that Mapbox created a Fontstack API and defined the glyph PBF format.

The expanded use of TinySDF creates a need for more granular control over font selection beyond the single font specifier for local “ideographic” text. The developer can already set the font specifier to the name of a Web font defined in the surrounding webpage’s stylesheet. However, this option is currently global; the font choice should come instead from text-font, which is no longer used for remote glyph rendering. An event handler will need to be added to call Map.prototype.redraw once any Web fonts are done loading.

Vertical alignment has often been cited as a downside of TinySDF, but it’s actually the glyph PBFs that are to blame. Glyph PBFs don’t encode enough glyph metrics to reliably align glyphs to a common baseline, so there were hard-coded fudge factors in a few different places in the codebase to vertically shift locally rendered glyphs to match remotely rendered glyphs. These fudge factors assume the metrics of Arial Unicode MS, which is the default text-font fallback but is also an outlier for its line height, even among pan-Unicode fonts. With the removal of the glyph PBF mechanism, it becomes possible to remove these fudge factors.

In theory, it should be possible to use TinySDF only for grapheme clusters that can’t be represented by the glyph PBF format. However, that would yield extremely inconsistent visual results, because most scripts that contain nontrivial grapheme clusters also have plenty of unclustered graphemes sprinkled throughout in ordinary text. If backwards compatibility with older styles is a concern, I believe it would be better to render glyphs grapheme clusters locally in general and at most render Latin, Cyrillic, and Greek from glyph PBFs as an exception.

The glyph PBF mechanism has been primarily of use to Western languages. Most published fontstacks consist of one or two specially chosen Western fonts combined with a crude, pan-Unicode font as an afterthought to serve as a fallback for the rest of the world’s languages. While the glyph PBF format has the advantage of not requiring embedding and redistribution rights from the font designer, the most popular fonts for non-Western languages are generally open-source fonts that can be served up as Web fonts and rendered through TinySDF without any legal obstacles.

Prior art

maplibre#2458 similarly relies on TinySDF for all text. However, it requires the tileset or GeoJSON data to include manually placed control characters to mark grapheme cluster boundaries. I think any required server-side hinting would significantly limit the deployment of complex text rendering to end users, as Mapbox discovered early on: mapbox/DEPRECATED-mapbox-gl#4 (comment). Even though Intl.Segmenter still falls short of the gold standard in Harfbuzz/Raqm/FriBiDi, it comes with such low overhead that GL JS might as well take advantage of it rather than allow Indic text to continue to get mangled.

Odds and ends

This branch also includes some miscellaneous fixes for things I spotted along the way. Some types were misspelled. Some unit tests relied on outdated fixtures that cast incompatible data to the expected data type; TypeScript only started flagging it once I modified the types just a little more.

These changes would fix maplibre#50 and maplibre#2384. I’m posting this draft in my own fork for now while I consider how to stage these changes in more manageable chunks and discuss the backwards compatibility issues with the MapLibre maintainers.

Additionally, the following proofs of concept are based on this one:

@1ec5 1ec5 self-assigned this Aug 19, 2024
@1ec5
Copy link
Owner Author

1ec5 commented Aug 19, 2024

To see the demonstration screenshotted above:

  1. Run npm run build-dist.
  2. Drop the contents of this gist into a file named index.html in the dist/ folder.
  3. Run npm start.
  4. Open http://0.0.0.0:9966/dist/index.html#3.03/16.25/17.32

@claysmalley
Copy link

claysmalley commented Aug 21, 2024

I had noticed some segmenting issues in Thai as a result of 304acf5, which led me on a quest to develop a better test suite for multilingual character segmentation.

The big thing with Indic scripts is conjunct consonants (i.e. ligatures) and combining vowel characters. Each script does this a little differently, but there are a few commonalities here and there:

  • Scripts of the northern Indian subcontinent tend to form conjuncts by reducing the first consonant to a "half form" and blending it into the following consonant. There are several exceptions.
  • Scripts of the southern subcontinent and Southeast Asia tend to form conjuncts by stacking a second consonant below, if at all. There are also several exceptions.
  • Thai and Lao don't have conjunct consonants. Consonant clusters are implied by context; there is no "virama" diacritic in modern use.
  • In conjuncts, /r/ often has a dramatically different appearance from its isolated form.
  • One notable exception within the subcontinent is the /kṣ/ conjunct (as in Lakshmi), which is usually a special form that looks unlike its components. This is even the case in Tamil, which otherwise has very few conjuncts compared to its neighbors.
  • In Burmese and Khmer (and some other scripts), consonant stacking is obligatory—these scripts have no "virama" diacritic that a renderer can simply fall back to. The Unicode codepoints referred to as VIRAMA in these scripts are actually Invisible Stackers (58ecf5c).
  • Thai and Lao are encoded in visual order (i.e. typewriter style) instead of logical order. Like other Indic scripts, Thai and Lao have certain vowel marks that are written preceding consonants. However, these particular vowels are encoded as standalone characters that literally precede their consonant, rather than being combining characters that advance the position of their attached consonant.
  • Tibetan is also encoded in visual order. Conjuncts are stacked, but each consonant is encoded with a separate combining character for the stacked version, instead of reusing the same codepoints with an Invisible Stacker in between.

Here are a few test cases that I think are good indicators for poorly aligned combining characters and ligatures. Especially effective when text-letter-spacing > 0.

1

"name_en": "Bengaluru",
"name_hi": "बेंगलुरु",
"name_gu": "બેંગલુરુ",
"name_pa": "ਬੈਂਗਲੁਰੂ",
"name_bn": "বেঙ্গালুরু",
"name_or": "ବେଙ୍ଗାଲୁରୁ",
"name_te": "బెంగళూరు",
"name_kn": "ಬೆಂಗಳೂರು",
"name_ml": "ബെംഗളൂരു",
"name_ta": "பெங்களூரு",
"name_si": "බැංගලෝර්",

2

"name_en": "Lakshmeshwara",
"name_hi": "लक्ष्मेश्वर",
"name_gu": "લક્ષ્મેશ્વર",
"name_pa": "ਲਕ੍ਸ਼੍ਮੇਸ਼੍ਵਰਾ",
"name_bn": "লক্ষ্মীশ্বর",
"name_or": "ଲକ୍ଷମେଶ୍ୱର",
"name_te": "లక్ష్మేశ్వర",
"name_kn": "ಲಕ್ಷ್ಮೇಶ್ವರ",
"name_ml": "ലക്ഷ്മേശ്വര",
"name_ta": "லக்ஷ்மேஸ்வரா",
"name_si": "ලක්ෂ්මේෂ්වර",

3

"name_en": "Mandalay",
"name_my": "မန္တလေးမြို့",

4

"name_en": "Mekong River",
"name_my": "မဲခေါင်မြစ်",
"name_th": "แม่น้ำโขง",
"name_lo": "ແມ່ນ້ຳຂອງ",
"name_km": "ទន្លេមេគង្គ",

5

"name_en": "Samdrup Jongkhar District",
"name_dz": "བསམ་གྲུབ་ལྗོངས་མཁར་རྫོང་ཁག་",

6

"name_en": "Blue Heron Nest Park",
"name_hur": "sməqʷəʔelə həw̓aləm̓ew̓txʷ",

@1ec5
Copy link
Owner Author

1ec5 commented Aug 21, 2024

A live demo is available in osm-americana/openstreetmap-americana#1149. To make sure these changes are heading in the right direction, I solicited some feedback from native language speakers on the OSM Asia Telegram chat and OSM India Telegram chat.

@claysmalley

This comment was marked as resolved.

@claysmalley

This comment was marked as resolved.

@claysmalley
Copy link

I solicited some feedback from native language speakers on the OSM Asia Telegram chat and OSM India Telegram chat.

The Bangladesh, India, Nepal and Thailand subforums might also be good places to ask.

@1ec5

This comment was marked as resolved.

@claysmalley
Copy link

I’m of half a mind to replace mapbox-gl-rtl-text’s processBidirectionalText with something homegrown

Who needs mapbox-gl-rtl-text when you've got maplibre-gl-all-the-text? 😉

@1ec5
Copy link
Owner Author

1ec5 commented Aug 22, 2024

I put in a workaround for now, because reimplementing bidirectional text support would be a large enough task for its own proof of concept. The workaround is to replace the zero-width joiner with an arbitrarily chosen strip marker and restore it after bidi processing. I don’t think this workaround will interfere with the mapbox-gl-rtl-text plugin’s Arabic shaping. OSM only has four name:ar=* tags that contain ZWJs, three of them seemingly by mistake and the fourth seemingly in Soranî, which isn’t supported by the plugin anyways.

@1ec5 1ec5 force-pushed the complex-text-50 branch 2 times, most recently from 91e523c to be2d95e Compare August 22, 2024 03:30
@claysmalley
Copy link

claysmalley commented Aug 22, 2024

(Edit: see following comment)

If there isn't a way to preserve the shirorekha across gaps between letters, then the text-letter-spacing property will make Bengali, Devanagari and Gurmukhi look clunky:

image

@1ec5
Copy link
Owner Author

1ec5 commented Aug 23, 2024

If there isn't a way to preserve the shirorekha across gaps between letters, then the text-letter-spacing property will make Bengali, Devanagari and Gurmukhi look clunky:

The real-world convention appears to be that the top bar should be broken up by syllable when introducing letter spacing. However, line-placed labels have gaps even without any letter spacing. #2 demonstrates a partial solution by shifting the baseline. Using the same information, we might even be able to artificially extend the top bar to either side of the glyph to close the gap even when the text is offset. However, the tradeoff is a very noticeable shift between hanging and alphabetic text, which may be undesirable.

The CSS specification describes a few example scenarios in which we would need to special-case the text segmentation differently for the purposes of line breaking, letter spacing, and text rendering. In some cases, we may even need to break apart and rearrange grapheme clusters to avoid a choppy appearance. I’m unsure whether this should be a high priority; it seems like native browser text layout doesn’t necessarily behave correctly either: w3c/iip#87.

Segment strings by grapheme cluster instead of by character when shaping and rendering text. Store glyphs, glyph requests, and glyph positions by grapheme cluster instead of by codepoint string. Added a simple polyfill for older versions of Firefox.
Increased the buffer around locally rendered glyphs.
Removed hard-coded fudge factors based on the baseline in Arial Unicode MS.
@1ec5
Copy link
Owner Author

1ec5 commented Aug 27, 2024

Thanks, that’s good to know about. Unfortunately, unicode-segmenter’s unpacked size is larger than mapbox-gl-rtl-text, which GL JS fetches from a CDN lazily due to its size, so I’m not sure the maintainers would be open to including it as a dependency for everyone. In maplibre#4541 (comment), they were open to requiring newer browsers going forward. Specifically, I introduced /…/u literals that work for 97.56% of browser users. (Unsupporting browsers would fail to load GL JS at all with a syntax error.) However, Intl.Segmenter is much newer: as written, this branch works in only 95.49% of browser users.

For compatibility’s sake, I added a much simpler polyfill that just splits the string by words based on RegExp word boundaries and by “grapheme clusters” based on Unicode character positions. Here’s what it’ll look like in a browser without support for Intl.Segmenter. Obviously it’s far from ideal, but hopefully it’s more usable than the status quo:

Broken combining diacritics galore.

@ramSeraph
Copy link

Makes sense. Word breaks were anyway not supported by that polyfill( any attempt at that would have increased the size of that polyfill further ). And thanks for pushing this through.

@claysmalley
Copy link

When the language is set to Burmese, several tiles fail to load because something is undefined. I can't replicate this with any other language.

image

@1ec5
Copy link
Owner Author

1ec5 commented Aug 28, 2024

This error seems to occur whenever the new line breaking code encounters Zawgyi-encoded text. For example, Pakistan is tagged name:my=ပါကစ္စတန်, which contains န် (U+1014 U+103A). Transforming it to valid Unicode, “ပါကစ်စတနျ”, would resolve the issue, at least for that particular label.

The broader issue is that I’ve tweaked the segmentation code to join grapheme clusters on virama characters, whereas the line breaking code’s word segmenter sometimes wants to break up the ligature. It only happens to be more common in Zawgyi-encoded text. This confuses a bit of code that tries to determine the grapheme’s advance based on the section’s formatting options.

I’ve tweaked it to gracefully fall back to the last known section. This means a grapheme after a word wrap might be formatted according to what precedes it. As with Arabic text, this PR essentially removes the ability to style one part of a grapheme cluster differently than another.

@claysmalley
Copy link

Is name:my=ပါကစ္စတန် really Zawgyi-encoded? That seems like the correct Unicode representation of the name of Pakistan, at least according to Burmese Wikipedia.

@bdon's Burmese Encoding QA tool is a good place to find examples of Zawgyi text on OSM.

@1ec5
Copy link
Owner Author

1ec5 commented Aug 28, 2024

That’s a wonderful tool! I was mistakenly assuming that the Zawgyi-my ICU transform would stabilize if fed Unicode-encoded text and jumping to the conclusion that it was misencoded. Anyhow, Intl.Segmenter consistently interprets an invisible stacker (such as a virama) as both a word boundary and a grapheme cluster boundary, but this PR combines the adjacent grapheme clusters. It’s unclear to me if we should therefore avoid a line break, but at least န် is fixed. I’m still seeing some errors involving ရှ် in ဘင်္ဂလားဒေ့ရှ် that I’ll need to investigate further.

@claysmalley
Copy link

claysmalley commented Aug 28, 2024

I was mistakenly assuming that the Zawgyi-my ICU transform would stabilize if fed Unicode-encoded text and jumping to the conclusion that it was misencoded.

Zawgyi and Unicode use the same codepoints in conflicting ways. The converter will happily let you convert any text Zawgyi to Unicode as many times as you like, even if it's already Unicode. Automatically detecting the encoding of Burmese text is not a trivial task.

ရှ် is likely Zawgyi. There are only a few consonants that the vowel killer mark can appear above, and that's not one of them. I was wrong; this is Unicode. I guess the rules are relaxed for foreign loanwords.

If a grapheme cluster begins with a combining diacritical mark or ends with an invisible stacker, combine it with an adjacent grapheme cluster to avoid drawing diacritics over dotted circles or placeholder diacritics where adjacent characters should be ligated instead.
Added a script that fetches the latest Unicode character database’s property file for Indic syllable categories and generates a function for combining graphemes based on it.
Replace zero-width joiners with temporary strip markers to prevent ICU from stripping them.
Preemptively swap combining marks with the characters they modify to visual order, so that the RTL plugin will swap them back to logical order.
Replaced custom word break heuristics when determining line breaks with a word segmenter. Added a simple polyfill for older versions of Firefox.
Fixed an issue where vertical CJK text was shifted upwards by one em.
Iterate over graphemes instead of words, looking for word boundaries to use as line breaking opportunities. This eliminates the possibility of word-wrapping in the middle of a grapheme cluster, which is valid in some writing systems such as Thai, but mitigates the risk of an invalid section index in Burmese, because the word segmenter considers some modifiers to be “words”.
@1ec5
Copy link
Owner Author

1ec5 commented Sep 7, 2024

Upon closer inspection, the errors were actually caused by a mismatch between the word segmenter, the built-in grapheme cluster segmenter, and the modified grapheme cluster segmenter as to င်္ and in the same word. Since the section indices are tightly coupled to grapheme clusters, I’ve rewritten the line breaking code to iterate over grapheme clusters, looking for word boundaries, instead of the other way around. In theory, this may eliminate some valid line breaking opportunities in Thai and Khmer that split grapheme clusters, but optimal line breaking isn’t as critical as avoiding exceptions in text rendering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants