Output string literals with CESU-8 sequences #32

natevw · 2014-09-25T00:13:57Z

This changes colony's internal string encoding from UTF-8 to CESU-8.

Will need a coordinated release with tessel/t1-runtime#542.

…ork if colony used as library in unsuitable environment, though…)

…ent variable

natevw · 2014-10-07T00:47:06Z

Although I say "coordinated" above, AFAICT this could be integrated right now with no particular change to the correctness/incorrectness of 'master' branch. Basically this needs to go out either before or alongside the runtime changes.

tcr · 2014-10-07T18:34:20Z

Approved. r+

Output string literals with CESU-8 sequences

@natevw

…odepoints This fixes most of colony's String compatibility issues stemming from the mismatch between JavaScript's pre–Unicode 2.0, and Lua's pre-historic, handling of Unicode string representations. ## Background Javascript's string object — like many of its era — was intended represent "Unicode" strings. However, when Unicode 2.0 was introduced it changed the definition of a codepoint (± think "character") so that it no longer fit within the16-bit unsigned integer type which `String` was designed around. (I'd wager that most JS code running out there still does not bother to process characters in the supplementary planes correctly; perhaps justifiably so, in a pre-ES6 world.\*\*) In light of this history, it is *now* fair to say that Javascript's strings are merely immutable buffers storing a block of 16-bit values; they are essentially "raw" UTF-16/UCS-2 encoded data. Lua's strings are basically immutable buffers storing a block of 8-bit values, and (before this patch) colony was using these to store UTF-8 string data. This string representation discrepancy (between JavaScript's UCS-2 and Colony's UTF-8) meant that only ASCII strings were fully compatible between implementations — even characters in the BMP would cause a mismatch in string lengths. For example, `"€5".length` equalled 2 in V8 (`[0x20ac, 0x0035]`), but 4 in colony (`[0xe2, 0x82, 0xac, 0x35]`). There were a few options for remedying this discrepancy. Storing UTF-16 to Lua's string blocks? Keeping UTF-8 internally and splitting ["astral"](https://mathiasbynens.be/notes/javascript-unicode) characters in all the places needed? See the [original PR thread](#137) for some discussion. In the end something of a compromise/hybrid approach was chosen. ## How strings are handled in Colony now This pull request changes Colony's internal string representation to [CESU-8](http://en.wikipedia.org/wiki/CESU-8). This has the advantage of being as compact as UTF-8 for BMP characters, but also keeping surrogate pairs split as UTF-16 does. So we can trivially match JS's ability to extract `"👀"[1]` while also (in theory\*) handling I/O in the default UTF-8 encoding of node.js as a straight memcpy in common cases. This made it reasonably easy to implement the basic string methods/properties — those taking in a UCS-2 index convert it to a CESU-8 index ("JsToLua") before calling the Lua methods. Those needing to return a UCS-2 index can convert the opposite direction ("LuaToJs") from a Lua method result. Outside of string itself, code needed to be audited and in many cases corrected to make sure it distinguished between an "internal colony string" (which should be ± treated as opaque, unless accessed via JS methods) and externally needed representations (usually UTF-8). ## Miscellany This patch **depends on tessel/colony-compiler#32 for pre-converting string literals into the correct internal representation. This patch also adds toLowerCase/toUpperCase methods, which may not work quite right in case of strings that get longer on case change. (Personally I @natevw wonder if we could just implement these ASCII-only; this would basically let us get rid of the utf8proc dependency and its concomitant code tables.) This patch **does not** finish adding/auditing all the encoding handling required of the `Buffer` object. `Buffer.prototype.toString` should be mostly correct, but does not yet handle 'utf16le'. And `new Buffer(str, enc)` is still in pretty bad shape. IMO that work is important but belongs in an additional patch once this lands. This patch **does not** fix RegExp index values, which were completely missing throughout most of this work and so were deemed out of scope. LA LA LA CAN'T SEE THE PULL REQUEST THAT TRIES TO ADD THOSE LA LA IGNORING LA LA LA LA \* Right now no optimizations are done. One simple one was added (for ASCII↔︎CESU-8 conversions) and led to a slight performance *regression* running `npm test` and was backed out. Right now all string "lookups" are O(m) based on target index. This will especially kill code that loops character-by-character through large strings; also note that each access off `.length` is unmitigatedly O(n). \*\* ES6 adds a number of facilities to help with *full* (post-2.0) Unicode support e.g. `String.fromCodePoint` and an extended literal character escaping syntax. This patch focuses on basic correctness and does not add support for these new methods/syntaxes.

@natevw

…odepoints This fixes most of colony's String compatibility issues stemming from the mismatch between JavaScript's pre–Unicode 2.0, and Lua's pre-historic, handling of Unicode string representations. ## Background Javascript's string object — like many of its era — was intended represent "Unicode" strings. However, when Unicode 2.0 was introduced it changed the definition of a codepoint (± think "character") so that it no longer fit within the16-bit unsigned integer type which `String` was designed around. (I'd wager that most JS code running out there still does not bother to process characters in the supplementary planes correctly; perhaps justifiably so, in a pre-ES6 world.\*\*) In light of this history, it is *now* fair to say that Javascript's strings are merely immutable buffers storing a block of 16-bit values; they are essentially "raw" UTF-16/UCS-2 encoded data. Lua's strings are basically immutable buffers storing a block of 8-bit values, and (before this patch) colony was using these to store UTF-8 string data. This string representation discrepancy (between JavaScript's UCS-2 and Colony's UTF-8) meant that only ASCII strings were fully compatible between implementations — even characters in the BMP would cause a mismatch in string lengths. For example, `"€5".length` equalled 2 in V8 (`[0x20ac, 0x0035]`), but 4 in colony (`[0xe2, 0x82, 0xac, 0x35]`). There were a few options for remedying this discrepancy. Storing UTF-16 to Lua's string blocks? Keeping UTF-8 internally and splitting ["astral"](https://mathiasbynens.be/notes/javascript-unicode) characters in all the places needed? See the [original PR thread](#137) for some discussion. In the end something of a compromise/hybrid approach was chosen. ## How strings are handled in Colony now This pull request changes Colony's internal string representation to [CESU-8](http://en.wikipedia.org/wiki/CESU-8). This has the advantage of being as compact as UTF-8 for BMP characters, but also keeping surrogate pairs split as UTF-16 does. So we can trivially match JS's ability to extract `"👀"[1]` while also (in theory\*) handling I/O in the default UTF-8 encoding of node.js as a straight memcpy in common cases. This made it reasonably easy to implement the basic string methods/properties — those taking in a UCS-2 index convert it to a CESU-8 index ("JsToLua") before calling the Lua methods. Those needing to return a UCS-2 index can convert the opposite direction ("LuaToJs") from a Lua method result. Outside of string itself, code needed to be audited and in many cases corrected to make sure it distinguished between an "internal colony string" (which should be ± treated as opaque, unless accessed via JS methods) and externally needed representations (usually UTF-8). ## Miscellany This patch **depends on tessel/colony-compiler#32 for pre-converting string literals into the correct internal representation. This patch also adds toLowerCase/toUpperCase methods, which may not work quite right in case of strings that get longer on case change. (Personally I @natevw wonder if we could just implement these ASCII-only; this would basically let us get rid of the utf8proc dependency and its concomitant code tables.) This patch **does not** finish adding/auditing all the encoding handling required of the `Buffer` object. `Buffer.prototype.toString` should be mostly correct, but does not yet handle 'utf16le'. And `new Buffer(str, enc)` is still in pretty bad shape. IMO that work is important but belongs in an additional patch once this lands. This patch **does not** fix RegExp index values, which were completely missing throughout most of this work and so were deemed out of scope. LA LA LA CAN'T SEE THE PULL REQUEST THAT TRIES TO ADD THOSE LA LA IGNORING LA LA LA LA \* Right now no optimizations are done. One simple one was added (for ASCII↔︎CESU-8 conversions) and led to a slight performance *regression* running `npm test` and was backed out. Right now all string "lookups" are O(m) based on target index. This will especially kill code that loops character-by-character through large strings; also note that each access off `.length` is unmitigatedly O(n). \*\* ES6 adds a number of facilities to help with *full* (post-2.0) Unicode support e.g. `String.fromCodePoint` and an extended literal character escaping syntax. This patch focuses on basic correctness and does not add support for these new methods/syntaxes.

natevw added 4 commits September 24, 2014 16:13

fix encoding typo

44cbccf

simplify non-safe string contents regexp

02057ce

one way to get colony-compiler to emit CESU-8 string data (will not w…

d809823

…ork if colony used as library in unsuitable environment, though…)

provide our own CESU-8 encoding so we aren't dependent on an environm…

0baa664

…ent variable

natevw mentioned this pull request Sep 25, 2014

[NRY] Implements Unicode (UCS-2) strings into VM tessel/t1-runtime#137

Closed

natevw mentioned this pull request Oct 7, 2014

Strings now exposed externally as array of UCS-2 codepoints tessel/t1-runtime#542

Merged

natevw changed the title ~~[NRY] Output string literals with CESU-8 sequences~~ Output string literals with CESU-8 sequences Oct 7, 2014

tcr added a commit that referenced this pull request Oct 7, 2014

Merge pull request #32 from natevw/nvw-cesu8

295e79a

Output string literals with CESU-8 sequences

tcr merged commit 295e79a into tessel:master Oct 7, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Output string literals with CESU-8 sequences #32

Output string literals with CESU-8 sequences #32

Uh oh!

natevw commented Sep 25, 2014

Uh oh!

natevw commented Oct 7, 2014

Uh oh!

tcr commented Oct 7, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Output string literals with CESU-8 sequences #32

Output string literals with CESU-8 sequences #32

Uh oh!

Conversation

natevw commented Sep 25, 2014

Uh oh!

natevw commented Oct 7, 2014

Uh oh!

tcr commented Oct 7, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants