Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode String #179

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 97 additions & 0 deletions text/0000-unicode-string.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
- Feature Name: unicode-string
- Start Date: 2020-07-30
- RFC PR: (leave this empty)
- Pony Issue: (leave this empty)

# Summary

Change the builtin String class to present a sequence of unicode codepoints as 4 byte numbers (U32) instead of a sequence of bytes. All conversions from a sequence of bytes to a String, or from a String to a sequence of bytes will require specifying a decoding or encoding respectively. The default encoding will be UTF-8, and Strings will be represented internally as a sequence of UTF-8 encoded codepoints. Provide encoders/ decoders for UTF-8, UTF-16, UTF-32, ASCII and ISO-8859-1 as part of the stdlib. Change the character literal to represent a single unicode codepoint and produce a UTF-32 codepoint value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not 100% clear to me what the internal representation of String will be. Will it remain a pointer and a size only that the bytes pointed to will be actual U32 codepoints without encoding? Or will the actual underlying data remain encoded? This RFC should also state why this approach was chosen and the other ones dicarded.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the sense that by "present a sequence..." he may just mean that it implements Seq[U32] and would support those APIs and that as he's said below the data would still be represented internally as UTF-8.

Copy link

@redvers redvers Mar 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to make sure I understand correctly, "hello" would be internally stored as five bytes: 104, 101, 108, 108, 111, but if we asked for third character it would return 0,0,0,108 (assuming endian)?

and, when you stray out of ASCII into codepoints, for example:

"hӧle" => 104, **211, 167**, 108, 101

Would still have a length of 4, and the third character would be l?


# Motivation

Unicode is an accepted standard for representing text in all languages used on planet earth. Modern programming languages should have first class support for unicode.

# Detailed design

Change the builtin String class to present of sequence of unicode codepoints as 4 byte numbers (U32).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to mention endianness here (which I assume will be the one of the machine for which the application is compiled)

```
class val String is (Seq[U32] & Comparable[String box] & Stringable)
```
Where String functions took parameters of type U8 or returned values of U8, they will take or return values of type U32. All indexes will be interpreted as codepoints. Only functions that manage String allocated memory will continue to refer to bytes.

The following additional changes will be made to the String class:
1. The size() function will return the number of unicode codepoints in the string. A new byte_size() function will return the number of bytes.
1. The truncate() function will only support len parameter values less than the string size.
1. The utf32() function will be removed. It is redundant, and returns pair that includes a byte count that is no longer needed.
1. The insert_byte() function will be changed to insert_utf32()
1. The values() function will return an iterator over the string codepoints. Same as runes().
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that we rename values to bytes instead of removing it entirely, so that the user can still iterate over bytes when they need to, without introducing an allocation to convert to Array[U8] val

Copy link

@jasoncarr0 jasoncarr0 Aug 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it's unclear to me about the allocations, but it does seem like the array() method has not been removed? If so, then one can still use that method for an allocation-less iteration method (and if necessary that type can be improved to fun box (): this->Array[U8] box). That said, I believe that bytes() is a much clearer name for that use case

1. A concat_bytes() function will be added to add a sequence of codepoints to the string from an iterator of bytes.

Add traits Encoder and Decoder to the builtin package. Any function that produces a String from bytes, or produces bytes from a String must take an Encoder or Decoder as a parameter as is appropriate.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding these type names to the builtin package means that no Pony program will be able to use those type names for user-defined types. The words Encoder and Decoder are incredibly general, and I can imagine many Pony programs or libraries using them to refer to things other than this string encoding use case.

I suggest using the names StringEncoder and StringDecoder instead.

```
trait val Encoder
fun encode(codepoint: U32): (USize, U8, U8, U8, U8)

trait val Decoder
fun decode(bytes: U32): (U32, U8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning of the elements of the returned tuple? Should this function be partial considering that not all byte-combinations are valid in every encoding?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a better signature/approach for encoding/decoding would be to encode a whole String into an Array[U8] and vice versa for decoding. This might seem more work on the implementor of Encoder/Decoder, but might allow to employ more tricks and be slightly more performant/convenient for some encodings. I also suspect, that implementing a special Encoder/Decoder is fairly rare.

```

The ByteSeq type defined in std_stream.pony will be changed to remove String.
```
type ByteSeq is (Array[U8] val)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's not much point in having this type alias if it is not a union.

The purpose of this type is to define the types for which it is safe to call cpointer and size to pass to FFI calls. In the standard library it is only safe to do those for standard library types, since we can't trust any user-defined types that implement the cpointer and size interface to give us a pointer that is valid for the given number of bytes.

Given that size is no longer the right method to call (it needs to call byte_size instead, I recommend just removing this union altogether and using (String | Array[U8] val) in places where it is currently used, and in each such place we would need to introduce a match statement that checks which type it is and uses size for Array and byte_size for String.

```
Many functions that accept ByteSeq as a parameter will be changed to accept (String | ByteSeq) as a parameter.

A new StringIter interface will be added to std_stream.pony
```
interface val StringIter
"""
An iterable collection of String box.
"""
fun values(): Iterator[this->String box]
```

Change Reader in buffered/reader.pony to add functions to read a codepoint and to read a String of a given number of codepoints. Update function line() to accept a decoder, and to return a pair with the line string, and the number of bytes consumed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth clarifying here if only \r\n and \n will be considered line breaks or if support for unicode line breaks (like U+2028, U+2029) will be added


Change Writer in buffered/writer.pony to accept Encoder parameters in the write() and writev() functions.

Add a FileCharacters class in buffered/file_characters.pony that provides an iterator of characters in a file. The implementation will be similar to the FileLines class.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you meant to put this in the files package (where it can live alongside FileLines) rather than the buffered package.

Suggested change
Add a FileCharacters class in buffered/file_characters.pony that provides an iterator of characters in a file. The implementation will be similar to the FileLines class.
Add a FileCharacters class in files/file_characters.pony that provides an iterator of characters in a file. The implementation will be similar to the FileLines class.


Change character literals so that a character literal can only represent a single unicode codepoint. The following would be valid character literals:
```
let l1 = 'A'
let l2 = '🐎'
let l3 = '\x61' \\ only hex values up to 7F will be accepted.
let l4 = '\u20AC' \\ Unicode codepoint '€'
let l5 = '\U01F3A0 \\ Unicode codepoint '🎠'
```

Change the Pony tutorial to reflect the changes to the String class and character literals. Also state clearly that UTF-8 is the only valid encoding for Pony source code.

Not Supported:
1. lower() and upper() for unicode characters. Should remove lower_in_place() and upper_in_place() because these conversion are not safe.
1. The StdStream's (out, err, in) do not support specifying a character encoding. Ideally, the encoding for these streams would be set by the system at run-time based on the default encoding of the underlying system. For now, they will use utf-8 only.

# How We Teach This

This can be presented as a continuation of existing Pony patterns.

The Pony tutorial will need to be updated to reflect these changes.

# How We Test This

Extend CI coverage to cover these changes and new features.

# Drawbacks

This is a change to the String API, and as such will break existing programs. String functions returning or taking as parameters U8 values now returning or taking U32 values will probably be the most common source of program breakage. Also, programs that used the existing unicode friendly functions in String will need to change, but they should be greatly simplified.

# Alternatives

1. Leave the String class as it is. This is likely to result in hidden errors in many programs that work with Strings as they will not work correctly with unicode data if they encounter it. It will also make it difficult to use Pony in settings where ASCII is not sufficient for local language (most non English speaking countries).
1. A more complicated implementation with multiple String types capable of storing text data internally using different byte encodings. This approach would improve performance in situations where strings needed to be created from bytes of various encodings. The String type could be matched to the native byte encoding to eliminate any need for conversion. Such an implementation would add complexity to the String API as it would require use of a String trait with multiple implementations. It would also add considerable complexity to the String class implementation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in our discussion in Zulip, I would prefer to see a version of this RFC that takes the encoding as a type parameter to all encoding-aware operations within the String class, using UTF-8 as the default (so that everything in this RFC works for the user as you describe by default), but so that non-UTF-8 use cases can also be supported.

This does not necessarily mean you need multiple String classes or a String trait - just multiple encoding classes and an encoding trait (I believe you already have an encoder/decoder traits as part of this RFC, so that could potentially be used).

1. Add support for multiple String types, where only one could be active at a time in the Pony run-time. This would be easier to implement compared to multiple concurrent String types, and would add no complexity to the String API. It would improve performance in locales where non-ASCII characters are more prevalent such as Asia.

# Unresolved questions

In lexer.cc I have incorporated a utf-8 decoding algorithm for character literals taken from https://bjoern.hoehrmann.de/utf-8/decoder/dfa/. Is this acceptable and how can credit be given?