[NRY] Implements Unicode (UCS-2) strings into VM #137

tcr · 2014-06-14T21:06:49Z

Not ready yet.

Conflicts: src/colony/lua/colony-js.lua src/colony/lua_tm.c test/tm/test.c

natevw · 2014-09-12T23:15:18Z

I can't build this branch. I get:

Undefined symbols for architecture x86_64:
  "_tm_ucs2_str_charat", referenced from:
      _l_tm_ucs2_str_charat in liblibcolony.a(libcolony.lua_tm.o)
  "_tm_ucs2_str_length", referenced from:
      _l_tm_ucs2_str_length in liblibcolony.a(libcolony.lua_tm.o)
  "_tm_utf8_char_encode", referenced from:
      _l_tm_utf8_char_encode in liblibcolony.a(libcolony.lua_tm.o)
  "_tm_utf8_str_tolower", referenced from:
      _l_tm_utf8_str_tolower in liblibcolony.a(libcolony.lua_tm.o)
  "_tm_utf8_str_toupper", referenced from:
      _l_tm_utf8_str_toupper in liblibcolony.a(libcolony.lua_tm.o)
ld: symbol(s) not found for architecture x86_64

Any ideas? Both the l_tm and the tm definitions are in the same file, but somehow the latter are getting optimized out?

natevw · 2014-09-15T23:41:55Z

Build fixed by natevw@9189dc6

…ht again

natevw · 2014-09-18T01:09:32Z

I think I have a plan. It is horrible and I like it.

Paragraph number background

Currently, colony stores strings as UTF-8, because Lua. We'd like to keep this, because Lua and default node.js encoding and memory usage. Memory usage is important. (So is Lua AFAICT.)

But JavaScript strings are pretty much UTF-16 arrays, and all our APIs have to act like that. This sucks, because unless we store a UTF-16 copy (which we can't even because all Lua strings share the same metatable) we ± have to convert from UTF-8 to UTF-16 every time any variety of substring/charCode is accessed. Which particularly sucks if the user is for (var i = 0; i < str.length; ++i) something(str[i]);ing in their code.

Paragraph number plan

We need an efficient (speed and storage) way to convert between UTF-16 indexes and UTF-8 indexes. Storing a size_t* buffer the length of the UTF-8 string is right out, it's more memory and missing a metatable to hang it on per-string anyway [maybe there's tricks to weak map instances to values but then it's still] more memory.

Here is my horrible idea:

colony-compiler munges strings during compilation
…it munges them to make sure they have CESU-80 split surrogate pairs
…it munges them to make sure the sequences are always either 1 or 3 bytes [depends on previous, required by next]
…it munges them to prefix a list of where all the multibyte sequences are (these offsets are themselves UTF-8 encoded ;-)

With that, we have to be careful how we use the raw Lua strings, but now e.g. str_mt.__index can do something like this:

function (str, idx16)
   -- *handwavy part* read prefix of str and convert idx to idx8+mbflag
   return (mbflag) ? str.sub(idx8, 3) : str.sub(idx8, 1)
end

…which looks suboptimal (and also Lua doesn't ternary so who does this guy think he's trying to fool) but I think will actually be waaaay more optimal than the current code, which involves all sorts of conversions which are all done by hops in/out of C and also require reading the whole string.

In the presumably common case of ASCII input, the prefix is just a single leading '\0' terminator and idx8 is just 1 (or 2? because Lua?)

Other considerations: a proper UTF-8 encoding must use the shortest representation (i.e. sometimes 2 bytes). If this matters (CESU isn't proper UTF-8 anyway, but still) so when making a Buffer from a string we might want to fix. But we already know where all the 3 bytes are. (This 3 byte thing is sort of the key to everything: otherwise our prefix would need to encode both the position and the length of multibyte sequences and who wants to do that? We don't have to, because we can pad 2-byte values to 3, and assume we won't get longer sequences because we're actually representing UTF-16 with our UTF-8…)

Paragraph number what next

If you've made it this far, and aren't scared, and don't tell me not too, I should get on with actually trying to fix colony UCS-2 support via this strategy. It will require a (hopefully smallish) change to colony-compiler, changing the string and regex logic to respect+leverage the prefix, making sure Buffer to/froms work too.)

natevw · 2014-09-18T22:56:13Z

Based on various accomplishments like such as in "Discussing with @tcr" and also "The Sleeping On It" here's an update on the horrible idea above:

It is horrible. It would likely have a ton of implications that would need to be dealt with, e.g. how to ensure obj.thing is the same as obj['thing'] and whatnot.
There's probably a way to track per-string metadata somewhere. There's certainly a will.

I'm still enamored with the general concept of tracking the offsets (and probably lengths, sans compiler munging) of only the multibyte sequence, or at least a tight/optimized-for-ASCII (table,index,lookup) thing is the way to go.

As a first pass, my plan now is to refactor the code to use a "helper function" for indexing — at first this helper function will be an inefficient brute force ordeal, then [assuming the approach proves useful] I hope to figure out how to do some sort of per-string-hash weakmap metadata association to make it fast.

natevw · 2014-09-18T23:00:28Z

(for later: http://www.lua.org/pil/17.html)

…ed fixup

raffecat · 2014-09-19T01:27:44Z

My admittedly unrefined thoughts on the matter came down to this: most strings in js tend to be used in an opaque manner: as field names, concatenated, read or written as utf-8; in all these cases, using utf-8 internally is fast, compact, and avoids conversions - especially for web content.

There are only a few places that must behave as if the encoding is UTF-16:

length property: this can walk the string on first access and cache UTF-16 length (including would-be surrogate pairs) in the string header, i.e. modify the VM to add a cached-length field.
charAt, charCodeAt, slice, substring: cache the last offset used in the string header, and search from there next time (inline caching) - makes most typical loops efficient enough, even backwards iteration; utf-8 is self-synchronizing.
search methods: the storage encoding doesn't have a great impact here, except for the start/end positions, which can make use of the cached offset.

Originally I thought of switching between 8-bit and UTF-16 internally when the string is created (Lua strings are immutable, and are actually pre-hashed at creation time for table lookups) - i.e. if the raw data contains any code points > 255, internally use an array of 16-bit code points instead of 8-bit data. This would impact every function that touches Lua strings, so I expect it would be a lot more work and would impact the LuaJIT runtime as well.

…that effort anyway

natevw · 2014-09-20T00:12:17Z

Made some good progress on this today. My current branch isn't much more "fixed" yet, but the .length and __index functions are now using an IMO better factored approach that should fingers crossed be nicely applicable everywhere else too.

Interesting things I've discovered along the way:

something in colony-compiler is already sanitizing UTF-8 (didn't look what) — this is good and bad
utf8proc is very strict about UTF-8 correctness, so we can't use it with CESU-8 or other non-sanitized forms

Neither of these is a huge showstopper but may affect the final approach to handling the "surrogate pairs" stuff. Before I worry about that though, I plan to do a first pass at everything assuming that we'll get CESU-8 right as the dust settles.

…casing tests now passing!

…t freed, but must point our caller to the copy Lua made

87c4055

…could significantly increase high-water memory mark, but since str_from_utf8 result is pushed onto lua stack the final value that is held on to will not have any unnecessary padding

Conflicts: src/colony/lua/colony-js.lua

…a bit more encoding savvy

…sions (which should *not* be returning bytestring as if it were a sanitized internal colony string!)

… tests a bit

…ome are for "internal colony string" (was called ucs2) and a few are for special UTF-8 stuff

`time npm test` — before real 1m18.064s user 0m47.821s sys 0m10.795s `time npm test` — after real 1m21.550s user 0m47.743s sys 0m10.626s So, um, phooey?

This reverts commit 57e046d.

natevw · 2014-10-06T23:21:56Z

Closing this PR, going to start make a new one with clearer notes about what it accomplishes.

@natevw

…odepoints This fixes most of colony's String compatibility issues stemming from the mismatch between JavaScript's pre–Unicode 2.0, and Lua's pre-historic, handling of Unicode string representations. ## Background Javascript's string object — like many of its era — was intended represent "Unicode" strings. However, when Unicode 2.0 was introduced it changed the definition of a codepoint (± think "character") so that it no longer fit within the16-bit unsigned integer type which `String` was designed around. (I'd wager that most JS code running out there still does not bother to process characters in the supplementary planes correctly; perhaps justifiably so, in a pre-ES6 world.\*\*) In light of this history, it is *now* fair to say that Javascript's strings are merely immutable buffers storing a block of 16-bit values; they are essentially "raw" UTF-16/UCS-2 encoded data. Lua's strings are basically immutable buffers storing a block of 8-bit values, and (before this patch) colony was using these to store UTF-8 string data. This string representation discrepancy (between JavaScript's UCS-2 and Colony's UTF-8) meant that only ASCII strings were fully compatible between implementations — even characters in the BMP would cause a mismatch in string lengths. For example, `"€5".length` equalled 2 in V8 (`[0x20ac, 0x0035]`), but 4 in colony (`[0xe2, 0x82, 0xac, 0x35]`). There were a few options for remedying this discrepancy. Storing UTF-16 to Lua's string blocks? Keeping UTF-8 internally and splitting ["astral"](https://mathiasbynens.be/notes/javascript-unicode) characters in all the places needed? See the [original PR thread](#137) for some discussion. In the end something of a compromise/hybrid approach was chosen. ## How strings are handled in Colony now This pull request changes Colony's internal string representation to [CESU-8](http://en.wikipedia.org/wiki/CESU-8). This has the advantage of being as compact as UTF-8 for BMP characters, but also keeping surrogate pairs split as UTF-16 does. So we can trivially match JS's ability to extract `"👀"[1]` while also (in theory\*) handling I/O in the default UTF-8 encoding of node.js as a straight memcpy in common cases. This made it reasonably easy to implement the basic string methods/properties — those taking in a UCS-2 index convert it to a CESU-8 index ("JsToLua") before calling the Lua methods. Those needing to return a UCS-2 index can convert the opposite direction ("LuaToJs") from a Lua method result. Outside of string itself, code needed to be audited and in many cases corrected to make sure it distinguished between an "internal colony string" (which should be ± treated as opaque, unless accessed via JS methods) and externally needed representations (usually UTF-8). ## Miscellany This patch **depends on tessel/colony-compiler#32 for pre-converting string literals into the correct internal representation. This patch also adds toLowerCase/toUpperCase methods, which may not work quite right in case of strings that get longer on case change. (Personally I @natevw wonder if we could just implement these ASCII-only; this would basically let us get rid of the utf8proc dependency and its concomitant code tables.) This patch **does not** finish adding/auditing all the encoding handling required of the `Buffer` object. `Buffer.prototype.toString` should be mostly correct, but does not yet handle 'utf16le'. And `new Buffer(str, enc)` is still in pretty bad shape. IMO that work is important but belongs in an additional patch once this lands. This patch **does not** fix RegExp index values, which were completely missing throughout most of this work and so were deemed out of scope. LA LA LA CAN'T SEE THE PULL REQUEST THAT TRIES TO ADD THOSE LA LA IGNORING LA LA LA LA \* Right now no optimizations are done. One simple one was added (for ASCII↔︎CESU-8 conversions) and led to a slight performance *regression* running `npm test` and was backed out. Right now all string "lookups" are O(m) based on target index. This will especially kill code that loops character-by-character through large strings; also note that each access off `.length` is unmitigatedly O(n). \*\* ES6 adds a number of facilities to help with *full* (post-2.0) Unicode support e.g. `String.fromCodePoint` and an extended literal character escaping syntax. This patch focuses on basic correctness and does not add support for these new methods/syntaxes.

@natevw

…odepoints This fixes most of colony's String compatibility issues stemming from the mismatch between JavaScript's pre–Unicode 2.0, and Lua's pre-historic, handling of Unicode string representations. ## Background Javascript's string object — like many of its era — was intended represent "Unicode" strings. However, when Unicode 2.0 was introduced it changed the definition of a codepoint (± think "character") so that it no longer fit within the16-bit unsigned integer type which `String` was designed around. (I'd wager that most JS code running out there still does not bother to process characters in the supplementary planes correctly; perhaps justifiably so, in a pre-ES6 world.\*\*) In light of this history, it is *now* fair to say that Javascript's strings are merely immutable buffers storing a block of 16-bit values; they are essentially "raw" UTF-16/UCS-2 encoded data. Lua's strings are basically immutable buffers storing a block of 8-bit values, and (before this patch) colony was using these to store UTF-8 string data. This string representation discrepancy (between JavaScript's UCS-2 and Colony's UTF-8) meant that only ASCII strings were fully compatible between implementations — even characters in the BMP would cause a mismatch in string lengths. For example, `"€5".length` equalled 2 in V8 (`[0x20ac, 0x0035]`), but 4 in colony (`[0xe2, 0x82, 0xac, 0x35]`). There were a few options for remedying this discrepancy. Storing UTF-16 to Lua's string blocks? Keeping UTF-8 internally and splitting ["astral"](https://mathiasbynens.be/notes/javascript-unicode) characters in all the places needed? See the [original PR thread](#137) for some discussion. In the end something of a compromise/hybrid approach was chosen. ## How strings are handled in Colony now This pull request changes Colony's internal string representation to [CESU-8](http://en.wikipedia.org/wiki/CESU-8). This has the advantage of being as compact as UTF-8 for BMP characters, but also keeping surrogate pairs split as UTF-16 does. So we can trivially match JS's ability to extract `"👀"[1]` while also (in theory\*) handling I/O in the default UTF-8 encoding of node.js as a straight memcpy in common cases. This made it reasonably easy to implement the basic string methods/properties — those taking in a UCS-2 index convert it to a CESU-8 index ("JsToLua") before calling the Lua methods. Those needing to return a UCS-2 index can convert the opposite direction ("LuaToJs") from a Lua method result. Outside of string itself, code needed to be audited and in many cases corrected to make sure it distinguished between an "internal colony string" (which should be ± treated as opaque, unless accessed via JS methods) and externally needed representations (usually UTF-8). ## Miscellany This patch **depends on tessel/colony-compiler#32 for pre-converting string literals into the correct internal representation. This patch also adds toLowerCase/toUpperCase methods, which may not work quite right in case of strings that get longer on case change. (Personally I @natevw wonder if we could just implement these ASCII-only; this would basically let us get rid of the utf8proc dependency and its concomitant code tables.) This patch **does not** finish adding/auditing all the encoding handling required of the `Buffer` object. `Buffer.prototype.toString` should be mostly correct, but does not yet handle 'utf16le'. And `new Buffer(str, enc)` is still in pretty bad shape. IMO that work is important but belongs in an additional patch once this lands. This patch **does not** fix RegExp index values, which were completely missing throughout most of this work and so were deemed out of scope. LA LA LA CAN'T SEE THE PULL REQUEST THAT TRIES TO ADD THOSE LA LA IGNORING LA LA LA LA \* Right now no optimizations are done. One simple one was added (for ASCII↔︎CESU-8 conversions) and led to a slight performance *regression* running `npm test` and was backed out. Right now all string "lookups" are O(m) based on target index. This will especially kill code that loops character-by-character through large strings; also note that each access off `.length` is unmitigatedly O(n). \*\* ES6 adds a number of facilities to help with *full* (post-2.0) Unicode support e.g. `String.fromCodePoint` and an extended literal character escaping syntax. This patch focuses on basic correctness and does not add support for these new methods/syntaxes.

tcr added 10 commits June 2, 2014 17:59

Adds UCS2 length test

224a2c2

Adds UTF8 upper/lower functions and tests.

1a6a276

Adds UTF-8 char methods (toupper, tolower, encode).

b974e00

Renames UCS2 functions to reflect operation on strs.

6bc6922

Implements fromCharCode in terms of UTF-8.

2e71ef9

Updates encoding test to work with npm test

56e667e

Adds string indexing support for ucs2 characters.

2e45fb8

fromCharCode should truncate UTF-16 values

0af8c8f

Length is properly reported as UCS-2

b9bb7b9

Implements toUpperCase and toLowerCase

6e5cbf5

This was referenced Jul 28, 2014

Adds multiple argument support to fromCharCode. Fixes #320 #327

Merged

String.fromCharCode doesn't support characters > 0xFF #328

Closed

tcr added the Pending-UTF8 label Aug 11, 2014

Merge branch 'master' into nvw-utf8

176de88

Conflicts: src/colony/lua/colony-js.lua src/colony/lua_tm.c test/tm/test.c

natevw mentioned this pull request Sep 12, 2014

Support UCS-2 #222

Closed

9 tasks

fix building utf8 branch

9189dc6

natevw added 2 commits September 17, 2014 13:22

a couple important (failing!) string tests

cba0f4f

move str charAt code to __index so they both work [approximately] rig…

bf057e6

…ht again

[confused/broken] first pass at lookup, wrong output and such will ne…

9887b10

…ed fixup

natevw added 4 commits September 19, 2014 15:21

get string length working, yay!

039010c

refactor 16to8 so as to make __index work now too :-)

7ad00c9

fix replacement vs. noncharacter, add note about overall futility of …

b6ab4dd

…that effort anyway

add note re. fun

be45e6e

natevw added 24 commits October 2, 2014 17:05

add note about escape function

8fe9cc6

implement Unicode-capable escape

cdb158d

implement unescape for good measure

e84d32f

clean up tests a bit (comments -> actual tests)

f57dc90

give colony string stack helpers much clearer names

573b2fe

implement tm_str_from_utf8 (albeit with some fudging) — all encoding/…

04d6250

…casing tests now passing!

fix bogus memory management — we cannot return memory we probably jus…

10105a6

…t freed, but must point our caller to the copy Lua made

back away from regression in test/issues/issue-runtime-289.js caused by

79f6dd7

87c4055

increase odds of successful conversion of line noise to CESU-8. this …

53b4af3

…could significantly increase high-water memory mark, but since str_from_utf8 result is pushed onto lua stack the final value that is held on to will not have any unnecessary padding

Merge branch 'master' into tcr-utf8

f6fb42f

Conflicts: src/colony/lua/colony-js.lua

fix that mismerged itoa thing that has been plaguing this branch…

453c176

clarify buffer->string conversion, needed now that Buffer has gotten …

ae16724

…a bit more encoding savvy

fix generating string from utf8 buffer, start clarifying other conver…

7355b84

…sions (which should *not* be returning bytestring as if it were a sanitized internal colony string!)

fix up base64 variable names to reveal its correctness

f905ba2

add poison pill and rough start on antidote

4b98a40

finish implementing correct Buffer to ascii/binary strings, improving…

579b64c

… tests a bit

Merge branch 'master' into tcr-utf8

555fec1

remove unused tm_ucs2_str_length

cb9e98f

clean up ucs2_str_codeat (née charat)

049b815

move char_encode to ucs2 suite (inverse of str_codeat)

e39d4b3

reorganize "final" string handling functions to better clarify that s…

7660e94

…ome are for "internal colony string" (was called ucs2) and a few are for special UTF-8 stuff

"optimize" UTF-8 to/from conversions

57e046d

`time npm test` — before real 1m18.064s user 0m47.821s sys 0m10.795s `time npm test` — after real 1m21.550s user 0m47.743s sys 0m10.626s So, um, phooey?

Revert ""optimize" UTF-8 to/from conversions"

c1c0208

This reverts commit 57e046d.

get rid of now-pointless comment

ee1e867

natevw closed this Oct 6, 2014

natevw mentioned this pull request Oct 7, 2014

Strings now exposed externally as array of UCS-2 codepoints #542

Merged

kevinmehall mentioned this pull request Oct 7, 2014

[WIP] Adds function for encoding flags along with strings. #543

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NRY] Implements Unicode (UCS-2) strings into VM #137

[NRY] Implements Unicode (UCS-2) strings into VM #137

Uh oh!

tcr commented Jun 14, 2014

Uh oh!

natevw commented Sep 12, 2014

Uh oh!

natevw commented Sep 15, 2014

Uh oh!

natevw commented Sep 18, 2014

Uh oh!

natevw commented Sep 18, 2014

Uh oh!

natevw commented Sep 18, 2014

Uh oh!

raffecat commented Sep 19, 2014

Uh oh!

natevw commented Sep 20, 2014

Uh oh!

natevw commented Oct 6, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[NRY] Implements Unicode (UCS-2) strings into VM #137

[NRY] Implements Unicode (UCS-2) strings into VM #137

Uh oh!

Conversation

tcr commented Jun 14, 2014

Uh oh!

natevw commented Sep 12, 2014

Uh oh!

natevw commented Sep 15, 2014

Uh oh!

natevw commented Sep 18, 2014

Paragraph number background

Paragraph number plan

Paragraph number what next

Uh oh!

natevw commented Sep 18, 2014

Uh oh!

natevw commented Sep 18, 2014

Uh oh!

raffecat commented Sep 19, 2014

Uh oh!

natevw commented Sep 20, 2014

Uh oh!

natevw commented Oct 6, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants