Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update lang docs #51

Merged
merged 2 commits into from
Sep 4, 2024
Merged

Update lang docs #51

merged 2 commits into from
Sep 4, 2024

Conversation

wipfli
Copy link
Contributor

@wipfli wipfli commented Aug 28, 2024

Happy for feedback @nvkelso!


Protomaps follows OpenStreetMaps's convention where a features's primary name value is is the most common name in the local language(s).

In practice, this is most often a single name value like:

- `London` the locality is represented as a simple key, value pair: `name` = `London`

However, many places have more than one common local languages and Protomaps passes thru OpenStreetMap's convention of concatenating multiple names with a `/` deliminator into a single name value, like:
However, many places have more than one common local languages and Protomaps passes thru OpenStreetMap's convention of concatenating multiple names with a `/` or `-` deliminator into a single name value, like:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true/relevant anymore now that we require a language to be passed to the style?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit less relevant than before. Only when the local language and the target language use different scripts you will see these. Example: Map localized to Greek and you are looking at Bozen - Bolzano...


Protomaps structures localized names using the same `name:{language_code}` formatting as OpenStreetMap.
If a name from OpenStreetMap contains text in more than one script, then Protomaps breaks up the name into segments. There can be up to 3 segments: `name`, `name2`, and `name3`. Each segment should have a unique script.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe clarify this is the name tag "de-facto primary local name" from OSM

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are the name2 and name3 synthetic properties used in the style?

Is there a downside to overwriting the value of name with (implied) name1?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the name splitting only happen when an allow listed delim (/ or -) is observed in the upstream OSM name tag?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe clarify this is the name tag "de-facto primary local name" from OSM

OK

How are the name2 and name3 synthetic properties used in the style?

Depends on the target language and the scripts in name2 and name3. For example, if the target language uses a different script from name2, name3, then those appear as second and third line. Other example, if the target language uses the same script as name3, then that appears as first line if the target language is not available.

Is there a downside to overwriting the value of name with (implied) name1?

Not that I know of.

Does the name splitting only happen when an allow listed delim (/ or -) is observed in the upstream OSM name tag?

No, it always happens when an OSM name contains more than one script. There are exceptions for example when you have 5 Arabic Unicode Codepoints, then the Latin letters "AB" and then again 5 Arabic Unicode Codepoints. That could be for example a Latin street letter in an Arabic street label. In that case the text is not segmented. Here is a link to some tests that cover special cases: https://github.com/protomaps/basemaps/blob/60e7d485c7fc6a4b28be525ebc03f6bdd4f20837/tiles/src/test/java/com/protomaps/basemap/names/ScriptSegmenterTest.java#L97-L127

basemaps/localization.md Outdated Show resolved Hide resolved

## Positioned glyph font `pmap:pgf:name:*` values

Protomaps adds additional names for a small set of language scripts, currently just the `Devanagari` script used for Hindi (`name:hi` and `pmap:pgf:name:hi`) and related languages.

Rendering text in web browsers works for almost all languages and scripts and feels like magic. However, specialized map renderers like MapLibre have to reimplement text rendering and text layout which is complicated when text needs to be curved along linear map features instead of placed only horizontally or vertically. MapLibre normally assumes a one-to-one mapping between glyphs and Unicode codepoints that also exist in MapLibre font files (aka "font stacks") to accomplish the layout for a large but limited number of scripts. Plugins have been developed to extend MapLibre for **right-to-left** scripts like Arabic and Hebrew, and MapLibre has built-in support for **CJK scripts** like Chinese, Japanese, and Korean.

To facilitate Protomap's support of additional, non-supported scripts in MapLibre (like the Devanagari script used by the Hindi language), Protomaps exports names with "positioned glphys" so MapLibre can use codepoints as indices of positioned glyphs in an additional custom "font stack". While the raw `pmap:pgf:name:*` values look like giberish when inspecting the raw values, they render correctly in MapLibre to the end user.
To facilitate Protomap's support of additional, non-supported scripts in MapLibre (like the Devanagari script used by the Hindi language), Protomaps exports names with "positioned glphys" so MapLibre can use codepoints as indices of positioned glyphs in an additional custom "font stack". While the raw `pmap:pgf:name:*` values look like gibberish when inspecting the raw values, they render correctly in MapLibre to the end user.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

glphys -> glyphs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Copy link
Collaborator

@nvkelso nvkelso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry we're starting to mix up Protomaps the tile schema with Protomaps the map style a bit in these docs.

@@ -12,31 +12,81 @@ Protomaps has several localization options for names used in text labels.

<MaplibreMap/>

## Default `name` value
## Local Names
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bdon what's the Title Case versus Sentence case convention elsewhere in docs that we should be following?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No rules right now! I feel Sentence case is more natural, but let's just fix them when we see them.


## Localized `name:*` values
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider restoring this section heading?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved further down and replaced with "Translated Names"


If `pmap:script*` is not present on a name, then it means that the name uses the `Latin` script.

Sometimes names might contain text in multiple scripts. In that case `pmap:script` is set to `Mixed`.
Copy link
Collaborator

@nvkelso nvkelso Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes a name might contain text in multiple scripts. In that case pmap:script is set to Mixed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this happen only if the delim is not present? As a future todo, would we want to split the string until it's no longer Mixed? As it stands it's a little confusing that above we say we do split the various names, but then it's confusing it could possibly still be Mixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to

Sometimes segmentation into single scripts fails due for example inconsistent usage of alphabets. In that case pmap:script is set to Mixed.

(pmap:script2 absent)
```

The OSM name for "Zürich" only uses the Latin script and therefore we use in Protomaps only `name` and leave `pmap:script` away which implies that the script of the `name` is `Latin`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OSM name for "Zürich" only uses the Latin script so we export name and but omit pmap:script (implying the script of the name is Latin).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

(pmap:script3 absent)
```

The OSM name for Hong Kong is "Hong Kong 香港". We break this up into `name` and `name2` in Protomaps. Since the script of `name` is `Latin`, the `pmap:script` tag is omitted. The script of `name2` is `Han` which is encoded in `pmap:script2`.
Copy link
Collaborator

@nvkelso nvkelso Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there have been a delim here in "Hong Kong 香港" (either a / or a -)?

Always spell out OpenStreetMap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there have been a delim here in "Hong Kong 香港" (either a / or a -)?

No, it does not have a delimiter. "香港 Hong Kong" is in https://www.openstreetmap.org/node/7414774650

Always spell out OpenStreetMap.

OK

- `name:zh-Hans` = `瑞士`
- `name:zh-Hant` = `瑞士`
- _... many other localized values..._

_NOTE: The Chinese (`zh`) examples above demonstrates how a single language can have multiple writing systems, in this case both simplified Chinese (`zh-Hans`) used in mainland China and tranditional Chinese (`zh-Hant`) used in Taiwan. The value stored in `zh` could be either of those._
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restore the zh note, please.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we always normalize zh in OSM to the two explicate variants (as is indicated in #51 (comment)), then this can be dropped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment we only export name:zh-Hant and name:zh-Hans from OpenStreetMap to the tiles. If one or both of these are missing on the OSM feature, but name:zh is available, then we backfill name:zh into name:zh-Hans or name:zh-Hant.

If I remember correctly, you had a technique to say if a name:zh string was written in name:zh-Hans or name:zh-Hant in tilezen. Is that correct? If yes, how did you do it?


To help solve this, Protomaps characterizes the scipt used in the default `name` value by adding a `pmap:script` tag.

Values in `pmap:script` follow the [ISO 15924](https://unicode.org/iso15924/iso15924-codes.html) standard codes for the representation of names of scripts and are summarized in the table below.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please restore this explanation of the ISO language codes as a note below the table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note that the script names are from Unicode Standard Annex #24: Script Names

| Language | Native name | `name:*` property | [ISO 639-2 code](https://en.wikipedia.org/wiki/List_of_ISO_639-2_codes) | [ISO_639-1 code](https://en.wikipedia.org/wiki/ISO_639-1) | [ISO_15924 script(s)](https://unicode.org/iso15924/iso15924-codes.html) |
|--------|-----------------|-----------|-----|----|----|
| Arabic | اَلْعَرَبِيَّةُ | `name:ar` | ara | ar | `Arabic` |
| Bengali | বাংলা | `name:bn` | ben | bn | `Bengali` |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's our longer term plan to support Bengali and Farsi (I think the only ones dropped from this list)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bengali to be added in the future as a positioned glyph font

Farsi we call it "Persian"

| ----- | ----- | ----- | ----- |
| Arabic | اَلْعَرَبِيَّةُ | `name:ar` | `Arabic` |
| Bulgarian | български | `name:bg` | `Cyrillic` |
| Chinese (Simplified) | 中文 汉语 | `name:zh-Hans` | `Han` |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we exporting zh or differentiating name:zh-Hans and name:zh-Hant?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not exporting zh in the tileset, because that is not a useful developer-facing choice of locale. Ideally we normalize zh in the raw data into both zh-Hans and zh-Hant if they are the same.

| Urdu | اردو | `name:ur` | `Arabic` |
| Vietnamese | Tiếng Việt | `name:vi` | `Latin` |

_*) `Mixed-Japanese` is a custom `pmap:script` value used for labels that contain Hiragana or Katakana mixed with a second or third script. In Japanese, these two scripts often appear in combination with others._
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* NOTE: Mixed-Japanese is a custom pmap:script value used for labels that contain Hiragana or Katakana mixed with a second or third script. In Japanese, these two scripts often appear in combination with others.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK


See more:

- [Traditional MapLibre Text Rendering](https://oliverwipfli.ch/about-text-rendering-in-maplibre-2023-10-17/)
- [Devanagari Positioned Glyph Fonts](https://oliverwipfli.ch/devanagari-in-the-protomaps-basemap-with-a-positioned-glyph-font-for-maplibre-2024-06-30/)

## Styling localized name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it's ok to take out this placeholder section... we're tilting towards using Protomaps as a platform solution instead of a modular system of tile schema, styles, and data archives. Ideally we offer some tips on how to work with the raw tile data, too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I think this section under the basemaps directory concerns the combination of style + tileset.

I think Localization concerns the style, because that is the API surface that developers interact with - generating a GL style for a given language.

I do think there ought to be one section in the Basemaps directory for the schema with no style opinions - probably fleshed out https://docs.protomaps.com/basemaps/layers (issue #1)? That can mention localized name tags, and link out to this Localization page that has more prose?

@wipfli
Copy link
Contributor Author

wipfli commented Sep 3, 2024

Thanks for the reviews @nvkelso and @bdon. Please let me know if I missed something.

@bdon bdon merged commit 6c6875e into protomaps:main Sep 4, 2024
1 check passed
@wipfli wipfli deleted the lang branch September 4, 2024 12:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants