Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of "character" isn't technically correct in description of version string sections #74

Open
daidoji opened this issue Mar 7, 2024 · 0 comments
Assignees
Labels

Comments

@daidoji
Copy link
Contributor

daidoji commented Mar 7, 2024

In the version string sections we have (in various places with emphasis added)

In Version String 2 sections

It provides a regular expression target for determining a serialized field map’s serialization format and size (character count) of its enclosing field map.

This length is the total number of characters in the serialization of the field map. The maximum length of a given field map serialization is thereby constrained to be 644 = 224 = 16,777,216 characters in length.

In Version String 1 sections

This length is the total number of characters in the serialization of the field map. The maximum length of a given field map serialization is thereby constrained to be 166 = 224 = 16,777,216 characters in length. For example, when the length of serialization is 384 decimal characters/bytes, the length part of the Version String has the value 000180.

In Unicode with an arbitrary encoding bytes make code points and code points combine to make graphemes. In a utf8 encoding scheme (implied for JSON and CBOR although counting characters in MGPK which is a binary encoding doesn't quite make much sense at all) most code points are 1 byte for 1 grapheme but some code points (namely those greater than 128) are turned into sequences of 2,3,4 bytes. So although most Western text will be 1byte for 1 character those in other languages might not be. Consider the Tamil:

>>> len('வணக்கம')
6

This issue will be resolved in the spec with what I think is the keripy approach (and a correct one as far as I can think it through) in that these size calculations should be the result of counting the bytes of a fully encoded serialization of the various field maps rather than the code points or graphemes. This is in accordance with the larger TLV scheme.

https://docs.python.org/3/howto/unicode.html

@daidoji daidoji self-assigned this Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants