-
-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode String #179
base: main
Are you sure you want to change the base?
Unicode String #179
Changes from 2 commits
9518166
0183a7a
56e1a46
151af1f
8101a70
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,97 @@ | ||||||
- Feature Name: unicode-string | ||||||
- Start Date: 2020-07-30 | ||||||
- RFC PR: (leave this empty) | ||||||
- Pony Issue: (leave this empty) | ||||||
|
||||||
# Summary | ||||||
|
||||||
Change the builtin String class to present a sequence of unicode codepoints as 4 byte numbers (U32) instead of a sequence of bytes. All conversions from a sequence of bytes to a String, or from a String to a sequence of bytes will require specifying a decoding or encoding respectively. The default encoding will be UTF-8, and Strings will be represented internally as a sequence of UTF-8 encoded codepoints. Provide encoders/ decoders for UTF-8, UTF-16, UTF-32, ASCII and ISO-8859-1 as part of the stdlib. Change the character literal to represent a single unicode codepoint and produce a UTF-32 codepoint value. | ||||||
|
||||||
# Motivation | ||||||
|
||||||
Unicode is an accepted standard for representing text in all languages used on planet earth. Modern programming languages should have first class support for unicode. | ||||||
|
||||||
# Detailed design | ||||||
|
||||||
Change the builtin String class to present of sequence of unicode codepoints as 4 byte numbers (U32). | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it would make sense to mention endianness here (which I assume will be the one of the machine for which the application is compiled) |
||||||
``` | ||||||
class val String is (Seq[U32] & Comparable[String box] & Stringable) | ||||||
``` | ||||||
Where String functions took parameters of type U8 or returned values of U8, they will take or return values of type U32. All indexes will be interpreted as codepoints. Only functions that manage String allocated memory will continue to refer to bytes. | ||||||
|
||||||
The following additional changes will be made to the String class: | ||||||
1. The size() function will return the number of unicode codepoints in the string. A new byte_size() function will return the number of bytes. | ||||||
1. The truncate() function will only support len parameter values less than the string size. | ||||||
1. The utf32() function will be removed. It is redundant, and returns pair that includes a byte count that is no longer needed. | ||||||
1. The insert_byte() function will be changed to insert_utf32() | ||||||
1. The values() function will return an iterator over the string codepoints. Same as runes(). | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suggest that we rename There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps it's unclear to me about the allocations, but it does seem like the array() method has not been removed? If so, then one can still use that method for an allocation-less iteration method (and if necessary that type can be improved to |
||||||
1. A concat_bytes() function will be added to add a sequence of codepoints to the string from an iterator of bytes. | ||||||
|
||||||
Add traits Encoder and Decoder to the builtin package. Any function that produces a String from bytes, or produces bytes from a String must take an Encoder or Decoder as a parameter as is appropriate. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Adding these type names to the builtin package means that no Pony program will be able to use those type names for user-defined types. The words I suggest using the names |
||||||
``` | ||||||
trait val Encoder | ||||||
fun encode(codepoint: U32): (USize, U8, U8, U8, U8) | ||||||
|
||||||
trait val Decoder | ||||||
fun decode(bytes: U32): (U32, U8) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the meaning of the elements of the returned tuple? Should this function be partial considering that not all byte-combinations are valid in every encoding? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe a better signature/approach for encoding/decoding would be to encode a whole String into an Array[U8] and vice versa for decoding. This might seem more work on the implementor of |
||||||
``` | ||||||
|
||||||
The ByteSeq type defined in std_stream.pony will be changed to remove String. | ||||||
``` | ||||||
type ByteSeq is (Array[U8] val) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's not much point in having this type alias if it is not a union. The purpose of this type is to define the types for which it is safe to call Given that |
||||||
``` | ||||||
Many functions that accept ByteSeq as a parameter will be changed to accept (String | ByteSeq) as a parameter. | ||||||
|
||||||
A new StringIter interface will be added to std_stream.pony | ||||||
``` | ||||||
interface val StringIter | ||||||
""" | ||||||
An iterable collection of String box. | ||||||
""" | ||||||
fun values(): Iterator[this->String box] | ||||||
``` | ||||||
|
||||||
Change Reader in buffered/reader.pony to add functions to read a codepoint and to read a String of a given number of codepoints. Update function line() to accept a decoder, and to return a pair with the line string, and the number of bytes consumed. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's worth clarifying here if only |
||||||
|
||||||
Change Writer in buffered/writer.pony to accept Encoder parameters in the write() and writev() functions. | ||||||
|
||||||
Add a FileCharacters class in buffered/file_characters.pony that provides an iterator of characters in a file. The implementation will be similar to the FileLines class. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you meant to put this in the
Suggested change
|
||||||
|
||||||
Change character literals so that a character literal can only represent a single unicode codepoint. The following would be valid character literals: | ||||||
``` | ||||||
let l1 = 'A' | ||||||
let l2 = '🐎' | ||||||
let l3 = '\x61' \\ only hex values up to 7F will be accepted. | ||||||
let l4 = '\u20AC' \\ Unicode codepoint '€' | ||||||
let l5 = '\U01F3A0 \\ Unicode codepoint '🎠' | ||||||
``` | ||||||
|
||||||
Change the Pony tutorial to reflect the changes to the String class and character literals. Also state clearly that UTF-8 is the only valid encoding for Pony source code. | ||||||
|
||||||
Not Supported: | ||||||
1. lower() and upper() for unicode characters. Should remove lower_in_place() and upper_in_place() because these conversion are not safe. | ||||||
1. The StdStream's (out, err, in) do not support specifying a character encoding. Ideally, the encoding for these streams would be set by the system at run-time based on the default encoding of the underlying system. For now, they will use utf-8 only. | ||||||
|
||||||
# How We Teach This | ||||||
|
||||||
This can be presented as a continuation of existing Pony patterns. | ||||||
|
||||||
The Pony tutorial will need to be updated to reflect these changes. | ||||||
|
||||||
# How We Test This | ||||||
|
||||||
Extend CI coverage to cover these changes and new features. | ||||||
|
||||||
# Drawbacks | ||||||
|
||||||
This is a change to the String API, and as such will break existing programs. String functions returning or taking as parameters U8 values now returning or taking U32 values will probably be the most common source of program breakage. Also, programs that used the existing unicode friendly functions in String will need to change, but they should be greatly simplified. | ||||||
|
||||||
# Alternatives | ||||||
|
||||||
1. Leave the String class as it is. This is likely to result in hidden errors in many programs that work with Strings as they will not work correctly with unicode data if they encounter it. It will also make it difficult to use Pony in settings where ASCII is not sufficient for local language (most non English speaking countries). | ||||||
1. A more complicated implementation with multiple String types capable of storing text data internally using different byte encodings. This approach would improve performance in situations where strings needed to be created from bytes of various encodings. The String type could be matched to the native byte encoding to eliminate any need for conversion. Such an implementation would add complexity to the String API as it would require use of a String trait with multiple implementations. It would also add considerable complexity to the String class implementation. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I mentioned in our discussion in Zulip, I would prefer to see a version of this RFC that takes the encoding as a type parameter to all encoding-aware operations within the This does not necessarily mean you need multiple String classes or a String trait - just multiple encoding classes and an encoding trait (I believe you already have an encoder/decoder traits as part of this RFC, so that could potentially be used). |
||||||
1. Add support for multiple String types, where only one could be active at a time in the Pony run-time. This would be easier to implement compared to multiple concurrent String types, and would add no complexity to the String API. It would improve performance in locales where non-ASCII characters are more prevalent such as Asia. | ||||||
|
||||||
# Unresolved questions | ||||||
|
||||||
In lexer.cc I have incorporated a utf-8 decoding algorithm for character literals taken from https://bjoern.hoehrmann.de/utf-8/decoder/dfa/. Is this acceptable and how can credit be given? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not 100% clear to me what the internal representation of String will be. Will it remain a pointer and a size only that the bytes pointed to will be actual
U32
codepoints without encoding? Or will the actual underlying data remain encoded? This RFC should also state why this approach was chosen and the other ones dicarded.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the sense that by "present a sequence..." he may just mean that it implements
Seq[U32]
and would support those APIs and that as he's said below the data would still be represented internally as UTF-8.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to make sure I understand correctly, "hello" would be internally stored as five bytes:
104, 101, 108, 108, 111
, but if we asked for third character it would return0,0,0,108
(assuming endian)?and, when you stray out of ASCII into codepoints, for example:
Would still have a length of 4, and the third character would be l?