Strings now exposed externally as array of UCS-2 codepoints #542

natevw · 2014-10-07T00:11:49Z

This fixes most of colony's String compatibility issues stemming from the mismatch between JavaScript's pre–Unicode 2.0, and Lua's pre-historic, handling of Unicode string representations.

Background

Javascript's string object — like many of its era — was intended represent "Unicode" strings. However, when Unicode 2.0 was introduced it changed the definition of a codepoint (± think "character") so that it no longer fit within the16-bit unsigned integer type which String was designed around. (I'd wager that most JS code running out there still does not bother to process characters in the supplementary planes correctly; perhaps justifiably so, in a pre-ES6 world.**)

In light of this history, it is now fair to say that Javascript's strings are merely immutable buffers storing a block of 16-bit values; they are essentially "raw" UTF-16/UCS-2 encoded data.

Lua's strings are basically immutable buffers storing a block of 8-bit values, and (before this patch) colony was using these to store UTF-8 string data.

This string representation discrepancy (between JavaScript's UCS-2 and Colony's UTF-8) meant that only ASCII strings were fully compatible between implementations — even characters in the BMP would cause a mismatch in string lengths. For example, "€5".length equalled 2 in V8 ([0x20ac, 0x0035]), but 4 in colony ([0xe2, 0x82, 0xac, 0x35]).

There were a few options for remedying this discrepancy. Storing UTF-16 to Lua's string blocks? Keeping UTF-8 internally and splitting "astral" characters in all the places needed? See the original PR thread for some discussion. In the end something of a compromise/hybrid approach was chosen.

How strings are handled in Colony now

This pull request changes Colony's internal string representation to CESU-8. This has the advantage of being as compact as UTF-8 for BMP characters, but also keeping surrogate pairs split as UTF-16 does. So we can trivially match JS's ability to extract "👀"[1] while also (in theory*) handling I/O in the default UTF-8 encoding of node.js as a straight memcpy in common cases.

This made it reasonably easy to implement the basic string methods/properties — those taking in a UCS-2 index convert it to a CESU-8 index ("JsToLua") before calling the Lua methods. Those needing to return a UCS-2 index can convert the opposite direction ("LuaToJs") from a Lua method result.

Outside of string itself, code needed to be audited and in many cases corrected to make sure it distinguished between an "internal colony string" (which should be ± treated as opaque, unless accessed via JS methods) and externally needed representations (usually UTF-8).

Miscellany

This patch depends on tessel/colony-compiler#32 for pre-converting string literals into the correct internal representation.

This patch also adds toLowerCase/toUpperCase methods, which may not work quite right in case of strings that get longer on case change. (Personally I @natevw wonder if we could just implement these ASCII-only; this would basically let us get rid of the utf8proc dependency and its concomitant code tables.)

This patch does not finish adding/auditing all the encoding handling required of the Buffer object. Buffer.prototype.toString should be mostly correct, but does not yet handle 'utf16le'. And new Buffer(str, enc) is still in pretty bad shape. IMO that work is important but belongs in an additional patch once this lands.

This patch does not fix RegExp index values, which were completely missing throughout most of this work and so were deemed out of scope. LA LA LA CAN'T SEE THE PULL REQUEST THAT TRIES TO ADD THOSE LA LA IGNORING LA LA LA LA

* Right now no optimizations are done. One simple one was added (for ASCII↔︎CESU-8 conversions) and led to a slight performance regression running npm test and was backed out. Right now all string "lookups" are O(m) based on target index. This will especially kill code that loops character-by-character through large strings; also note that each access off .length is unmitigatedly O(n).

** ES6 adds a number of facilities to help with full (post-2.0) Unicode support e.g. String.fromCodePoint and an extended literal character escaping syntax. This patch focuses on basic correctness and does not add support for these new methods/syntaxes.

natevw · 2014-10-07T00:44:10Z

Alrighty @tcr this is all yours.

tcr · 2014-10-07T01:46:57Z

@natevw Reviewing now. I've added a branch of runtime which allows you to attach a byte of metadata to strings: #543 Tested that all like strings ("apple", "app" .. "le") have the same data. The same API can be used on the JIT branch after some ifdefs.

tcr · 2014-10-07T09:24:22Z

deps/utf8proc/utf8proc.c

If the expectation is that we'll back out of utf8_proc, can this function be copied and moved to tm_utf8? I'm interested in keeping the deps/ unmodified, or preferably gone.

If go back to ASCII-only .toUpperCase/.toLowerCase (which is not spec, but TBH I'm not sure what real-world use it serves outside of case-insensitive ASCII protocol stuff — since they don't do normalization they aren't properly useful for much "natural language" stuff anyway but perhaps that's debatable…)

Anyway, if all we're using utf8_proc for is for basic ~~UTF-8~~CESU-8 (and occasionally UTF-8) iteration/generation we could easily make our own versions in tm_str and drop it.

tcr · 2014-10-07T09:43:41Z

I'll do a more thorough code review tomorrow; I'm very excited by what's here.

As an aside, if you think utf8_proc doesn't add much and will require more modifications, we might as well rip out what we need. In particular, if the upper and lower tables can be recreated in a much smaller data structure, we'll save nearly 300kb in the runtime (no small feat!)

natevw · 2014-10-07T20:21:12Z

More on utf8_proc: I didn't realize until yesterday that this very branch is where utf8_proc and the Unicode-savvy upper/lower casing was introduced. The custom modification definitely felt dirty, but conveniently the dep wasn't a proper submodule and YOLO right?

So given that ASCII-only upper/lower would not be a regression, why don't we:

implement the two CESU-8/UTF-8 functions we actually need for the core string encoding work (bytes→uchar, uchar→bytes)
remove utf8_proc dep, leave ourselves with ASCII-only case changing for now
look into incorporating the code tables later (and in the context of full ES6-level Unicode support or even Intl…locale stuff is simply going to be a lot of data but happy to look into how tightly we can pack a good subset)

Should I go ahead and do the first two steps on this PR?

tcr · 2014-10-08T02:43:39Z

Tangent: taking a short break, decided to see how good GCC (well, clang) is at compressing uppercase/lowercase tables: https://github.com/tcr/unicode-tables

We can handle this later though, plain ASCII is good for now.

natevw · 2014-10-10T00:58:06Z

Alright, utf8proc is gone. We now have tm_utf8_decode/tm_utf8_encode functions which are used for both UTF-8 and CESU-8 needs. String casing now just crosses its fingers and uses lua.upper/lua.lower.

I noticed that our implementation of tm_str_codeat is outdated (includes unnecessary manual handling of UTF-8 pairs that are no longer possible) and tm_str_fromcode is a bit sketch (implicit buf_len and input codepoint range check) — I may try clean up those yet too but they are "correct" within current usage.

Prevents having a massive code subtraction as part of the PR.

tcr · 2014-10-10T02:21:53Z

Rebased over master and removed utf8proc separately so this PR remains slim.

tcr · 2014-10-10T02:40:42Z

Okay, I've re-read through the patch. I'm happy with its current state and think that we can tackle individual issues in separate issues:

Add string UTF-8 and UCS-2 length as internal properties of Lua string
Optimize tm_str_codeat and tm_str_fromcode
Expose colony_ functions for external (really just internal, but clean) use.
Flag strings as ASCII-only as an optimization.
Speed up iteration/decode with optimized C functions.

The blockers for getting this merged are creating those issues on the issue tracker, and resolving the Travis commit that's broken (it's complaining -Weverything style about signedness). Once that's fixed, we can attempt merge on Rampart.

natevw · 2014-10-10T03:59:39Z

Just pushed a quick fix to the signedness trouble, but unfortunately seems the rebase broke the build somehow:

$ make colony
gyp  colony.gyp --depth=. -f ninja -D enable_ssl=1 -D enable_net=1 -D compiler_path="/Users/natevw/Desktop/Clients/Technical_Machine/runtime/node_modules/colony-compiler/bin/colony-compiler.js" && ninja -C out/Release
ninja: Entering directory `out/Release'
ninja: error: '../../src/colony/modules/events.js', needed by 'gen/dir_builtin.c', missing and no known rule to make it
make: *** [colony] Error 1

UPDATE: removed the offending line from libcolony.gyp and was able to "successfully" build but it's somehow horribly uninitialized when actually running:

$ colony test/suite/string.js 
(error traceback is not a string)
lua_pcall error 2

./test/tap.js:8: attempt to index global 'console' (a nil value)
[T]: src/colony/lua/colony-node.lua:71: attempt to index field 'process' (a nil value)
[T]: src/colony/lua/colony-node.lua:71: attempt to index field 'process' (a nil value)
[T]: src/colony/lua/colony-node.lua:71: attempt to index field 'process' (a nil value)
[T]: src/colony/lua/colony-node.lua:71: attempt to index field 'process' (a nil value)
[T]: src/colony/lua/colony-node.lua:71: attempt to index field 'process' (a nil value)
[T]: src/colony/lua/colony-node.lua:71: attempt to index field 'process' (a nil value)
[T]: src/colony/lua/colony-node.lua:71: attempt to index field 'process' (a nil value)
…

Wish I hadn't deleted my properly merged local branch to "force pull" over your rebase :-(

natevw · 2014-10-10T04:11:26Z

@tcr — I'm not going to keep playing whack-a-mole on this. Unless it's actually working fine and I just somehow have a bad working copy, can you please get npm test passing again on this branch?

tcr · 2014-10-10T05:56:34Z

The last commit should fix one rebase issue, .n vs .length.

Your specific issue: run make update before you make colony. That's a temporary but needed fix, it adds events.js and domain.js from the deps/node-libs folder.

I can not rebase branches in the future if that's going to be an issue. Sorry, tests passed fine on my machine after the rebase.

…ehavior that came up in code review

…n `null`

…point too, eh?

tcr · 2014-10-10T22:55:17Z

Final rebase. Tests are not passing on Tessel, bisecting now.

tm-rampart · 2014-10-11T00:13:53Z

Approved by @tcr. Running tests.

@natevw

…odepoints This fixes most of colony's String compatibility issues stemming from the mismatch between JavaScript's pre–Unicode 2.0, and Lua's pre-historic, handling of Unicode string representations. ## Background Javascript's string object — like many of its era — was intended represent "Unicode" strings. However, when Unicode 2.0 was introduced it changed the definition of a codepoint (± think "character") so that it no longer fit within the16-bit unsigned integer type which `String` was designed around. (I'd wager that most JS code running out there still does not bother to process characters in the supplementary planes correctly; perhaps justifiably so, in a pre-ES6 world.\*\*) In light of this history, it is *now* fair to say that Javascript's strings are merely immutable buffers storing a block of 16-bit values; they are essentially "raw" UTF-16/UCS-2 encoded data. Lua's strings are basically immutable buffers storing a block of 8-bit values, and (before this patch) colony was using these to store UTF-8 string data. This string representation discrepancy (between JavaScript's UCS-2 and Colony's UTF-8) meant that only ASCII strings were fully compatible between implementations — even characters in the BMP would cause a mismatch in string lengths. For example, `"€5".length` equalled 2 in V8 (`[0x20ac, 0x0035]`), but 4 in colony (`[0xe2, 0x82, 0xac, 0x35]`). There were a few options for remedying this discrepancy. Storing UTF-16 to Lua's string blocks? Keeping UTF-8 internally and splitting ["astral"](https://mathiasbynens.be/notes/javascript-unicode) characters in all the places needed? See the [original PR thread](#137) for some discussion. In the end something of a compromise/hybrid approach was chosen. ## How strings are handled in Colony now This pull request changes Colony's internal string representation to [CESU-8](http://en.wikipedia.org/wiki/CESU-8). This has the advantage of being as compact as UTF-8 for BMP characters, but also keeping surrogate pairs split as UTF-16 does. So we can trivially match JS's ability to extract `"👀"[1]` while also (in theory\*) handling I/O in the default UTF-8 encoding of node.js as a straight memcpy in common cases. This made it reasonably easy to implement the basic string methods/properties — those taking in a UCS-2 index convert it to a CESU-8 index ("JsToLua") before calling the Lua methods. Those needing to return a UCS-2 index can convert the opposite direction ("LuaToJs") from a Lua method result. Outside of string itself, code needed to be audited and in many cases corrected to make sure it distinguished between an "internal colony string" (which should be ± treated as opaque, unless accessed via JS methods) and externally needed representations (usually UTF-8). ## Miscellany This patch **depends on tessel/colony-compiler#32 for pre-converting string literals into the correct internal representation. This patch also adds toLowerCase/toUpperCase methods, which may not work quite right in case of strings that get longer on case change. (Personally I @natevw wonder if we could just implement these ASCII-only; this would basically let us get rid of the utf8proc dependency and its concomitant code tables.) This patch **does not** finish adding/auditing all the encoding handling required of the `Buffer` object. `Buffer.prototype.toString` should be mostly correct, but does not yet handle 'utf16le'. And `new Buffer(str, enc)` is still in pretty bad shape. IMO that work is important but belongs in an additional patch once this lands. This patch **does not** fix RegExp index values, which were completely missing throughout most of this work and so were deemed out of scope. LA LA LA CAN'T SEE THE PULL REQUEST THAT TRIES TO ADD THOSE LA LA IGNORING LA LA LA LA \* Right now no optimizations are done. One simple one was added (for ASCII↔︎CESU-8 conversions) and led to a slight performance *regression* running `npm test` and was backed out. Right now all string "lookups" are O(m) based on target index. This will especially kill code that loops character-by-character through large strings; also note that each access off `.length` is unmitigatedly O(n). \*\* ES6 adds a number of facilities to help with *full* (post-2.0) Unicode support e.g. `String.fromCodePoint` and an extended literal character escaping syntax. This patch focuses on basic correctness and does not add support for these new methods/syntaxes.

… of UCS-2 codepoints

tcr · 2014-10-11T00:25:38Z

Landed! Beautiful! On to smaller patches.

Mild overhaul to finish up Buffer encoding work which was started in the ["Strings" pull request](#542) but wasn't really fully implemented or thoroughly checked over. Highlights: - adds utf16le encoding/decoding (which had been completely absent) - improves ascii/binary/hex/base64 logic too - generally cleaner in/out logic

…ings Mild overhaul to finish up Buffer encoding work which was started in the ["Strings" pull request](#542) but wasn't really fully implemented or thoroughly checked over. Highlights: - adds utf16le encoding/decoding (which had been completely absent) - improves ascii/binary/hex/base64 logic too - generally cleaner in/out logic

natevw mentioned this pull request Oct 7, 2014

Output string literals with CESU-8 sequences tessel/colony-compiler#32

Merged

tcr reviewed Oct 7, 2014
View reviewed changes

natevw added a commit that referenced this pull request Oct 10, 2014

back to ASCII-only casemappings for now, see #542 (comment)

96fb8e7

tm-rampart added a commit that referenced this pull request Oct 10, 2014

Merge #555 r=@tcr: Removes utf8proc as part of #542

dfae813

Prevents having a massive code subtraction as part of the PR.

natevw added a commit that referenced this pull request Oct 10, 2014

back to ASCII-only casemappings for now, see #542 (comment)

529d02b

tcr force-pushed the tcr-utf8 branch from ea352d9 to 12889f2 Compare October 10, 2014 02:21

natevw and others added 12 commits October 10, 2014 15:51

avoid accidental globals caught by @tcr in review

ecdfbce

add a few more string indexing test cases to lock in some (correct) b…

5703ae0

…ehavior that came up in code review

properly, out-of-range string indexing returns undefined rather tha…

30e0cf4

…n `null`

probably makes sense to explicitly test corrct string length at some …

b954866

…point too, eh?

we can use math.huge here now, so avoid extra str.length calls

4ff8963

back to ASCII-only casemappings for now, see #542 (comment)

5901a31

drop stale comment

9bafcb3

no more utf8proc uses

7484f14

fix regression in proper handling of bad UTF-8 lead sequence byte

1ca3a74

Fixes misplaced .length for .n.

72f41fd

Fixes signedness issues.

8f26859

Updates Travis to use "make test" instead of "npm test".

1926149

tcr force-pushed the tcr-utf8 branch from c2fe3c2 to 1926149 Compare October 10, 2014 22:54

Fixes running bytecode through require()

dc1d438

tm-rampart added a commit to tessel/t1-firmware that referenced this pull request Oct 11, 2014

tessel/t1-runtime#542 r=@tcr: Strings now exposed externally as array…

2a90f0f

… of UCS-2 codepoints

tm-rampart merged commit dc1d438 into master Oct 11, 2014

tcr deleted the tcr-utf8 branch October 11, 2014 00:25

tcr restored the tcr-utf8 branch October 11, 2014 00:25

tcr mentioned this pull request Oct 11, 2014

String.fromCharCode doesn't support characters > 0xFF #328

Closed

natevw referenced this pull request Oct 13, 2014

Remove use of mbtowcs in hsregex.

84c055c

natevw mentioned this pull request Nov 18, 2014

Fix up Buffer's handling of all the encodings #645

Merged

Frijol deleted the tcr-utf8 branch August 20, 2015 16:52

Strings now exposed externally as array of UCS-2 codepoints #542

Strings now exposed externally as array of UCS-2 codepoints #542

Uh oh!

Conversation

natevw commented Oct 7, 2014

Background

How strings are handled in Colony now

Miscellany

Uh oh!

natevw commented Oct 7, 2014

Uh oh!

tcr commented Oct 7, 2014

Uh oh!

tcr Oct 7, 2014

Choose a reason for hiding this comment

Uh oh!

natevw Oct 7, 2014

Choose a reason for hiding this comment

Uh oh!

tcr commented Oct 7, 2014

Uh oh!

natevw commented Oct 7, 2014

Uh oh!

tcr commented Oct 8, 2014

Uh oh!

natevw commented Oct 10, 2014

Uh oh!

tcr commented Oct 10, 2014

Uh oh!

tcr commented Oct 10, 2014

Uh oh!

natevw commented Oct 10, 2014

Uh oh!

natevw commented Oct 10, 2014

Uh oh!

tcr commented Oct 10, 2014

Uh oh!

tcr commented Oct 10, 2014

Uh oh!

tm-rampart commented Oct 11, 2014

Uh oh!

tcr commented Oct 11, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants