Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode String #179

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 100 additions & 0 deletions text/0000-unicode-string.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
- Feature Name: unicode-string
- Start Date: 2020-07-30
- RFC PR: (leave this empty)
- Pony Issue: (leave this empty)

# Summary

Change the builtin String class to present a sequence of unicode codepoints as 4 byte numbers (U32) instead of a sequence of bytes. All conversions from a sequence of bytes to a String, or from a String to a sequence of bytes will require specifying a decoding or encoding respectively. The default encoding will be UTF-8, and Strings will be represented internally as a sequence of UTF-8 encoded codepoints. Provide encoders/ decoders for UTF-8, UTF-16, UTF-32, ASCII and ISO-8859-1 as part of the stdlib. Change the character literal to represent a single unicode codepoint and produce a UTF-32 codepoint value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not 100% clear to me what the internal representation of String will be. Will it remain a pointer and a size only that the bytes pointed to will be actual U32 codepoints without encoding? Or will the actual underlying data remain encoded? This RFC should also state why this approach was chosen and the other ones dicarded.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the sense that by "present a sequence..." he may just mean that it implements Seq[U32] and would support those APIs and that as he's said below the data would still be represented internally as UTF-8.

Copy link

@redvers redvers Mar 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to make sure I understand correctly, "hello" would be internally stored as five bytes: 104, 101, 108, 108, 111, but if we asked for third character it would return 0,0,0,108 (assuming endian)?

and, when you stray out of ASCII into codepoints, for example:

"hӧle" => 104, **211, 167**, 108, 101

Would still have a length of 4, and the third character would be l?


# Motivation

Unicode is an accepted standard for representing text in all languages used on planet earth. Modern programming languages should have first class support for unicode.

# Detailed design

Change the builtin String class to present of sequence of unicode codepoints as 4 byte numbers (U32).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to mention endianness here (which I assume will be the one of the machine for which the application is compiled)

```
class val String is (Seq[U32] & Comparable[String box] & Stringable)
```
Where String functions took parameters of type U8 or returned values of U8, they will take or return values of type U32. All indexes will be interpreted as codepoints. Only functions that manage String allocated memory will continue to refer to bytes.

The following additional changes will be made to the String class:
1. The size() function will return the number of unicode codepoints in the string. A new byte_size() function will return the number of bytes.
1. The truncate() function will only support len parameter values less than the string size. The len parameter defines a number of codepoints. Add a resize() function to set the size of a string in bytes. This is needed for FFI calls where the _ptr is populated by the function, and the string length in bytes is provided by the function.
1. The utf32() function will be removed. It is redundant, and returns pair that includes a byte count that is no longer needed.
1. The insert_byte() function will be changed to insert_utf32()
1. The values() function will return an iterator over the string codepoints. Same as runes(). A new bytes() function will return an iterator over string encoded as bytes. The bytes() function will take a StringEncoder parameter.
1. A concat_bytes() function will be added to add a sequence of codepoints to the string from an iterator of bytes.
1. Change 'fun val array: Array[U8] val' to 'fun array: this->Array[U8] box'. This is a superset of what we have now, and allows a readable "byte string" reference to a String ref, rather than requiring val.

Add whatever methods we need to add to Array[U8] that are restricted by A:U8 (e.g. read_u8) to make Array[U8] have everything that a "byte string" class needs.

Add traits StringEncoder and StringDecoder to the builtin package. Any function that produces a String from bytes, or produces bytes from a String must take a StringEncoder or StringDecoder as a type parameter as is appropriate.
```
trait val StringEncoder
fun encode(codepoint: U32): (USize, U8, U8, U8, U8)

trait val StringDecoder
fun decode(bytes: U32): (U32, U8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning of the elements of the returned tuple? Should this function be partial considering that not all byte-combinations are valid in every encoding?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a better signature/approach for encoding/decoding would be to encode a whole String into an Array[U8] and vice versa for decoding. This might seem more work on the implementor of Encoder/Decoder, but might allow to employ more tricks and be slightly more performant/convenient for some encodings. I also suspect, that implementing a special Encoder/Decoder is fairly rare.

```

The ByteSeq type defined in std_stream.pony will be changed to remove String.
```
type ByteSeq is (Array[U8] val)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's not much point in having this type alias if it is not a union.

The purpose of this type is to define the types for which it is safe to call cpointer and size to pass to FFI calls. In the standard library it is only safe to do those for standard library types, since we can't trust any user-defined types that implement the cpointer and size interface to give us a pointer that is valid for the given number of bytes.

Given that size is no longer the right method to call (it needs to call byte_size instead, I recommend just removing this union altogether and using (String | Array[U8] val) in places where it is currently used, and in each such place we would need to introduce a match statement that checks which type it is and uses size for Array and byte_size for String.

```
Many functions that accept ByteSeq as a parameter will be changed to accept (String | ByteSeq) as a parameter.

A new StringIter interface will be added to std_stream.pony
```
interface val StringIter
"""
An iterable collection of String box.
"""
fun values(): Iterator[this->String box]
```

Change Reader in buffered/reader.pony to add functions to read a codepoint and to read a String of a given number of codepoints. Update function line() to accept a decoder, and to return a pair with the line string, and the number of bytes consumed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth clarifying here if only \r\n and \n will be considered line breaks or if support for unicode line breaks (like U+2028, U+2029) will be added


Change Writer in buffered/writer.pony to accept StringEncoder parameters in the write() and writev() functions.

Add a FileCharacters class in files/file_characters.pony that provides an iterator of characters in a file. The implementation will be similar to the FileLines class in files/file_lines.pony.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the Unicode standard does not define what a character is, I think it's worth claryfing if it means codepoints, grapheme clusters or something else


Change character literals so that a character literal can only represent a single unicode codepoint. The following would be valid character literals:
```
let l1 = 'A'
let l2 = '🐎'
let l3 = '\x61' \\ only hex values up to 7F will be accepted.
let l4 = '\u20AC' \\ Unicode codepoint '€'
let l5 = '\U01F3A0 \\ Unicode codepoint '🎠'
```

Change the Pony tutorial to reflect the changes to the String class and character literals. Also state clearly that UTF-8 is the only valid encoding for Pony source code.

Not Supported:
1. lower() and upper() for unicode characters. Should remove lower_in_place() and upper_in_place() because these conversion are not safe.
1. The StdStream's (out, err, in) do not support specifying a character encoding. Ideally, the encoding for these streams would be set by the system at run-time based on the default encoding of the underlying system. For now, they will use utf-8 only.

# How We Teach This

This can be presented as a continuation of existing Pony patterns.

The Pony tutorial will need to be updated to reflect these changes.

# How We Test This

Extend CI coverage to cover these changes and new features.

# Drawbacks

This is a change to the String API, and as such will break existing programs. String functions returning or taking as parameters U8 values now returning or taking U32 values will probably be the most common source of program breakage. Also, programs that used the existing unicode friendly functions in String will need to change, but they should be greatly simplified.

# Alternatives

1. Leave the String class as it is. This is likely to result in hidden errors in many programs that work with Strings as they will not work correctly with unicode data if they encounter it. It will also make it difficult to use Pony in settings where ASCII is not sufficient for local language (most non English speaking countries).
1. A more complicated implementation with multiple String types capable of storing text data internally using different byte encodings. This approach would improve performance in situations where strings needed to be created from bytes of various encodings. The String type could be matched to the native byte encoding to eliminate any need for conversion. Such an implementation would add complexity to the String API as it would require use of a String trait with multiple implementations. It would also add considerable complexity to the String class implementation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in our discussion in Zulip, I would prefer to see a version of this RFC that takes the encoding as a type parameter to all encoding-aware operations within the String class, using UTF-8 as the default (so that everything in this RFC works for the user as you describe by default), but so that non-UTF-8 use cases can also be supported.

This does not necessarily mean you need multiple String classes or a String trait - just multiple encoding classes and an encoding trait (I believe you already have an encoder/decoder traits as part of this RFC, so that could potentially be used).

1. Add support for multiple String types, where only one could be active at a time in the Pony run-time. This would be easier to implement compared to multiple concurrent String types, and would add no complexity to the String API. It would improve performance in locales where non-ASCII characters are more prevalent such as Asia.

# Unresolved questions

In lexer.cc I have incorporated a utf-8 decoding algorithm for character literals taken from https://bjoern.hoehrmann.de/utf-8/decoder/dfa/. Is this acceptable and how can credit be given?