Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode String #179

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Unicode String #179

wants to merge 5 commits into from

Conversation

rowland66
Copy link

RFC for changes to String class to better support unicode.

Copy link
Member

@jemc jemc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together. I have left some comments about things I would prefer to see changed.

Also, I suspect that this RFC may not be totally complete yet, in terms of tracking what all of the implications would be. I think it would be valuable for someone to start implementing this change (particularly the method signatures, even if the code for encoding/decoding is not fully robust yet) so that we could examine in more detail all of the things that will be affected by this change.

I also suspect there could be certain implications within this RFC that may be implicitly understood by the author, but may be not clear to the reader, resulting in confusion about exactly how these changes will look. The step I mentioned above of putting together a proof of concept level implementation could also help with this problem of potential misunderstandings of this very complex change.

1. The values() function will return an iterator over the string codepoints. Same as runes().
1. A concat_bytes() function will be added to add a sequence of codepoints to the string from an iterator of bytes.

Add traits Encoder and Decoder to the builtin package. Any function that produces a String from bytes, or produces bytes from a String must take an Encoder or Decoder as a parameter as is appropriate.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding these type names to the builtin package means that no Pony program will be able to use those type names for user-defined types. The words Encoder and Decoder are incredibly general, and I can imagine many Pony programs or libraries using them to refer to things other than this string encoding use case.

I suggest using the names StringEncoder and StringDecoder instead.


The ByteSeq type defined in std_stream.pony will be changed to remove String.
```
type ByteSeq is (Array[U8] val)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's not much point in having this type alias if it is not a union.

The purpose of this type is to define the types for which it is safe to call cpointer and size to pass to FFI calls. In the standard library it is only safe to do those for standard library types, since we can't trust any user-defined types that implement the cpointer and size interface to give us a pointer that is valid for the given number of bytes.

Given that size is no longer the right method to call (it needs to call byte_size instead, I recommend just removing this union altogether and using (String | Array[U8] val) in places where it is currently used, and in each such place we would need to introduce a match statement that checks which type it is and uses size for Array and byte_size for String.


Change Writer in buffered/writer.pony to accept Encoder parameters in the write() and writev() functions.

Add a FileCharacters class in buffered/file_characters.pony that provides an iterator of characters in a file. The implementation will be similar to the FileLines class.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you meant to put this in the files package (where it can live alongside FileLines) rather than the buffered package.

Suggested change
Add a FileCharacters class in buffered/file_characters.pony that provides an iterator of characters in a file. The implementation will be similar to the FileLines class.
Add a FileCharacters class in files/file_characters.pony that provides an iterator of characters in a file. The implementation will be similar to the FileLines class.

1. The truncate() function will only support len parameter values less than the string size.
1. The utf32() function will be removed. It is redundant, and returns pair that includes a byte count that is no longer needed.
1. The insert_byte() function will be changed to insert_utf32()
1. The values() function will return an iterator over the string codepoints. Same as runes().
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that we rename values to bytes instead of removing it entirely, so that the user can still iterate over bytes when they need to, without introducing an allocation to convert to Array[U8] val

Copy link

@jasoncarr0 jasoncarr0 Aug 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it's unclear to me about the allocations, but it does seem like the array() method has not been removed? If so, then one can still use that method for an allocation-less iteration method (and if necessary that type can be improved to fun box (): this->Array[U8] box). That said, I believe that bytes() is a much clearer name for that use case

# Alternatives

1. Leave the String class as it is. This is likely to result in hidden errors in many programs that work with Strings as they will not work correctly with unicode data if they encounter it. It will also make it difficult to use Pony in settings where ASCII is not sufficient for local language (most non English speaking countries).
1. A more complicated implementation with multiple String types capable of storing text data internally using different byte encodings. This approach would improve performance in situations where strings needed to be created from bytes of various encodings. The String type could be matched to the native byte encoding to eliminate any need for conversion. Such an implementation would add complexity to the String API as it would require use of a String trait with multiple implementations. It would also add considerable complexity to the String class implementation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in our discussion in Zulip, I would prefer to see a version of this RFC that takes the encoding as a type parameter to all encoding-aware operations within the String class, using UTF-8 as the default (so that everything in this RFC works for the user as you describe by default), but so that non-UTF-8 use cases can also be supported.

This does not necessarily mean you need multiple String classes or a String trait - just multiple encoding classes and an encoding trait (I believe you already have an encoder/decoder traits as part of this RFC, so that could potentially be used).

Copy link
Contributor

@mfelsche mfelsche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating this RFC!

I left a few comments.

The main concern here is that this very substantial change to the language is missing some motivational comments. Especially: What will be improved? What will this change make possible, which isnt possible yet?

I wanna see the costs of implementing this RFC be justified.


# Summary

Change the builtin String class to present a sequence of unicode codepoints as 4 byte numbers (U32) instead of a sequence of bytes. All conversions from a sequence of bytes to a String, or from a String to a sequence of bytes will require specifying a decoding or encoding respectively. The default encoding will be UTF-8, and Strings will be represented internally as a sequence of UTF-8 encoded codepoints. Provide encoders/ decoders for UTF-8, UTF-16, UTF-32, ASCII and ISO-8859-1 as part of the stdlib. Change the character literal to represent a single unicode codepoint and produce a UTF-32 codepoint value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not 100% clear to me what the internal representation of String will be. Will it remain a pointer and a size only that the bytes pointed to will be actual U32 codepoints without encoding? Or will the actual underlying data remain encoded? This RFC should also state why this approach was chosen and the other ones dicarded.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the sense that by "present a sequence..." he may just mean that it implements Seq[U32] and would support those APIs and that as he's said below the data would still be represented internally as UTF-8.

Copy link

@redvers redvers Mar 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to make sure I understand correctly, "hello" would be internally stored as five bytes: 104, 101, 108, 108, 111, but if we asked for third character it would return 0,0,0,108 (assuming endian)?

and, when you stray out of ASCII into codepoints, for example:

"hӧle" => 104, **211, 167**, 108, 101

Would still have a length of 4, and the third character would be l?

fun encode(codepoint: U32): (USize, U8, U8, U8, U8)

trait val Decoder
fun decode(bytes: U32): (U32, U8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning of the elements of the returned tuple? Should this function be partial considering that not all byte-combinations are valid in every encoding?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a better signature/approach for encoding/decoding would be to encode a whole String into an Array[U8] and vice versa for decoding. This might seem more work on the implementor of Encoder/Decoder, but might allow to employ more tricks and be slightly more performant/convenient for some encodings. I also suspect, that implementing a special Encoder/Decoder is fairly rare.

@rowland66
Copy link
Author

rowland66 commented Aug 4, 2020 via email

@rowland66
Copy link
Author

rowland66 commented Aug 4, 2020 via email

@jemc
Copy link
Member

jemc commented Aug 4, 2020

I agree that a sample implementation would help. I have 70% of an
implementation done. How I go about sharing that with the group? Create
a pull request?

You don't necessarily need to make a pull request - you can just make sure your branch is public on GitHub in your fork of the ponyc repo, then share the link here so we can go inspect your branch as we read the RFC.

@jemc
Copy link
Member

jemc commented Aug 4, 2020

Regarding some of your other responses - I'm a bit confused currently, but I'll wait and look at your sample implementation so I can be sure I understand your comments correctly before I respond.

@rowland66
Copy link
Author

rowland66 commented Aug 4, 2020 via email

@jasoncarr0
Copy link

jasoncarr0 commented Aug 5, 2020

I'm worried about the encoding, and it wasn't clear to me if the proposal would leave that unchanged and merely implement the Seq trait as though it were a sequence of U32 elements, or make the representation more pervasive.

With a representation as a subset of byte sequences we can have Array[U8] val -> String? and fun bytes(): this->Array[U8] box which would remain safe and allocation-less, enabling more interop and smoother interaction with existing code.

@rowland66
Copy link
Author

rowland66 commented Aug 7, 2020 via email

@SeanTAllen
Copy link
Member

Rowland has updated the RFC in response to comments.

@jemc @jasoncarr0 @mfelsche can you have another look?

@jemc
Copy link
Member

jemc commented Aug 18, 2020

Some notes from the post-sync-call on suggestions to improve the performance and flexibility of various use cases, based on some limited review of Rowland's branch so far:

  • add embed _array: Array[U8] as the field within String, instead of the three fields we have now that happen to be the same as Array[U8]

    • this makes it so that string to array conversions can be zero allocation - every string already contains an array that it can share as a box/val. See the next bullet point for more info;
  • fun val array: Array[U8] val become fun array: this->Array[U8] box

    • it is a superset of what we have now, and allows a readable "byte string" reference to a String ref, rather than requiring val.
  • add whatever methods we need to add Array[U8] that are restricted by A:U8 (e.g. read_u8) to make Array[U8] have everything that a "byte string" class needs

    • this will mitigate the usability issues that will result from bytewise operations disappearing from String.
  • encoder as a type parameter rather than a runtime parameter

    • also make the method use iftype E <: UTF8StringEncoder instead of if encoder is UTF8StringEncoder
    • these together make it so that there is zero cost for detecting the encoder at runtime - it is known at compile time.

@jasoncarr0
Copy link

jasoncarr0 commented Aug 18, 2020

* encoder as a type parameter rather than a runtime parameter
  
  * also make the method use `iftype E <: UTF8StringEncoder` instead of `if encoder is UTF8StringEncoder`
  * these together make it so that there is zero cost for detecting the encoder at runtime - it is known at compile time.

Though, we should perhaps verify that there is a runtime cost being paid here without the generics. I suspect given the size and structure of the method that LLVM would have an easy time eliminating it.

@rowland66
Copy link
Author

rowland66 commented Aug 20, 2020 via email

@jasoncarr0
Copy link

jasoncarr0 commented Aug 20, 2020

@rowland66 The reasons that these changes are being mentioned now is because of the change to byte-wise API in the String. Efficient access to the underlying array and byte array methods is one way to allow us to access these lost APIs more effectively, without compromising on performance. It is true that I certainly don't know why the existing String has not been implemented that way.

Hence why these suggestions are intended to be part of this RFC (but note that implementation details such as the embed could be implemented in multiple PRs). Anyone else feel free to chime in if I'm not understanding correctly.


# Detailed design

Change the builtin String class to present of sequence of unicode codepoints as 4 byte numbers (U32).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to mention endianness here (which I assume will be the one of the machine for which the application is compiled)

fun values(): Iterator[this->String box]
```

Change Reader in buffered/reader.pony to add functions to read a codepoint and to read a String of a given number of codepoints. Update function line() to accept a decoder, and to return a pair with the line string, and the number of bytes consumed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth clarifying here if only \r\n and \n will be considered line breaks or if support for unicode line breaks (like U+2028, U+2029) will be added


Change Writer in buffered/writer.pony to accept StringEncoder parameters in the write() and writev() functions.

Add a FileCharacters class in files/file_characters.pony that provides an iterator of characters in a file. The implementation will be similar to the FileLines class in files/file_lines.pony.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the Unicode standard does not define what a character is, I think it's worth claryfing if it means codepoints, grapheme clusters or something else

@rowland66
Copy link
Author

rowland66 commented Oct 21, 2020 via email

@jasoncarr0
Copy link

jasoncarr0 commented Dec 4, 2020

I'd like to leave a note here that, after looking over the PR (which has not yet implemented this portion), I believe my suggested change for the capability of the array() method causes very confusing behavior and unsafety in some edge cases with the expected implementation.

This can be fixed by using an embedded array, or two separate methods with restricted types and necessary warnings about confusing behavior.

@jasoncarr0
Copy link

jasoncarr0 commented Dec 8, 2020

We talked about this over Sync. The plan we went forward with is keeping the array method as val only. This will continue the behavior of not copying for UTF8Encoder

The low-level access can be provided by a method with the suggested type signature:
fun current_byte_buffer(): this->Array[U8] box
which always constructs the underlying byte array without regard to encoding. This will share the buffer with the String, but construct a new read-only Array object pointing to it.

The documentation for the method will indicate that the array will track writes until the String's buffer is reallocated due to length changes.

@rowland66
Copy link
Author

rowland66 commented Dec 9, 2020 via email

@jasoncarr0
Copy link

jasoncarr0 commented Dec 9, 2020

I did not understand what "this->Array[U8] box" would mean

this->Array[U8] box is valid syntax for the return type.
What that means can be factored into
this->(Array[U8] box)
i.e. it's the type that the current receiver views box as. Specifically this means that if the receiver is box or ref, it is indeed just box. But if the receiver is val then it means val. This is still useful given the other array method as it doesn't copy for any encoding.

In this case we want to ensure regardless of the receiver that it is read-only (so as to not break the string invariants), but we can be a bit stronger for val than just using box

Also, how does Pony garbage collection work for this type of thing. I seems
that we now have 2 references to the same memory. How does the Pony runtime
know than and not collect the underlying array bytes while they are still
referenced?

As I understand it, the allocation as a whole is still garbage collected (and it needs to be read by the garbage collector for general arrays) and deallocation doesn't occur until neither the String nor Array are pointing to it.

@rowland66
Copy link
Author

rowland66 commented Dec 11, 2020 via email

@jasoncarr0
Copy link

jasoncarr0 commented Dec 11, 2020

In this case you can use a bare recover without specifying a capability. I think this is a bit of a bug that complex capabilities cannot be written as the destination cap of a recover block (although it's only needed because of pointers here rather than actual fields). That is:

  fun current_byte_buffer(): this->Array[U8] box =>
    recover Array[U8].from_cpointer(_ptr._unsafe(), _size, _alloc) end

Strictly speaking you should also be able to recover to val and cover it, but it's not actually "correct" for most of the cases

@jasoncarr0
Copy link

jasoncarr0 commented Dec 11, 2020

Finally got my ponyc env working again. The field access checks for recover blocks are still a tiny bit too strict, they should drop down to tag instead of yelling in this case, so you actually need to do:

  fun current_byte_buffer(): this->Array[U8] box =>
    let ptr = _ptr
    recover Array[U8].from_cpointer(ptr._unsafe(), _size, _alloc) end

But this code builds for me

Base automatically changed from master to main February 8, 2021 22:15
@esoterra
Copy link

esoterra commented Feb 28, 2021

Hello, just realized this PR existed and am going to carry my feedback over from the issue I originally commented on. I provided a quick overview of some differences between what was suggested there and what is done in Rust and the benefits of those design choices.

My understanding is that this RFC encodes all strings as UTF-8 byte arrays and only allows users to operate on them using their character indices. Assuming that is correct and all reads and mutations of strings are performed on the byte representation, then I am concerned there will be issues with offering efficient string operations.

Specifically because on an arbitrary UTF-8 byte sequence computing the byte index from the character index is O(n).
To my knowledge the string representations used by existing languages tend to avoid this issue by either not presenting the user with character indexing (e.g. Rust, see my prior response) or by encoding strings so that indexing by character can be done in constant time (e.g. Python, see this).

What is the intended approach of this RFC and how does it avoid or mitigate this issue?

@rowland66
Copy link
Author

Hi Kyle,

Thank you for taking the time to look at this. I am the author of the RFC, and have worked on the implementation. As you mention, the biggest down side to using a variable size encoding for unicode is that index operations are slow, and the implementation of this RFC has no particularly good solution. Pony does have an Array class that has many of the same functions as String, so for performance critical code, one option is to use Array's to hold String bytes. Also, some of the String functions take positive and negative index values. Negative indexes are relative to the end of the String, so in some cases index access using a negative index would be much faster.

In how different programming languages deal with Strings seems to be a tradeoff between speed, efficiency and correctness. As I understand it, Rust strings are UTF-8 encoded internally, but there are functions to access the underlying bytes, but they are prone to panics if used incorrectly. So Rust seems to be leaning toward correctness, but provides a loaded gun in case you want to take it and shoot yourself. Phython seems to be favoring efficiency and speed with the ability to choose between 1, 2 and 4 bytes per character strings. I assume that adding certain unicode characters to a 1 or 2 byte per character string results in some type of runtime error. So it is easier to write programs that fail at runtime unless you use 4 character byte strings, but that is pretty inefficient in most cases. The Pony implemenation in this RFC favors correctness (number 1 on the list in the Pony Philosophy).

In the RFC design, we have tried to leave room for alternate String implementations that might be better suited for some situations. The new String design also tries to remain as comaptible as possible with the existing Pony String class. The existing Pony String has a mix of functions some of which assume 1 byte per character and others that support a UTF-8 encoding. This was a confusing API that left a lot of room for programming error.

Hope this makes sense, and I am happy to discuss further if you have some ideas for improvement.

@esoterra
Copy link

esoterra commented Mar 1, 2021

Hi Rowland,

Thanks for getting back to me quickly, I have a few specific comments below about things you mentioned and have tried to better outline which Rust features would best inform a Pony Unicode String implementation.
Hopefully this will help clarify why I think Rust and Python act as archetypes for their chosen string strategy and why I think they are a good reference for designing a new string API.

RE: Negative Indices

That is an interesting point about negative indices. If reverse unicode segmentation is used, you can potentially traverse fewer bytes.

RE: Rust

they are prone to panics if used incorrectly

This is technically true of the Rust design, but there are panic-free options available and the ones that can panic are meant to be used with something like char_indices which ensure that you are at a valid boundary. The idiomatic ways of using Rust's String API don't meaningfully sacrifice correctness.

Additionally, the fact that it does allow panic using &a[i..j] is not a feature I am recommending Pony replicate.
The rest of the API and its tradeoffs can exist entirely independent of it because a.get(i..j) provides the same functionality.

RE: Python

I assume that adding certain unicode characters to a 1 or 2 byte per character string results in some type of runtime error

This is not correct because strings in python are immutable and you cannot create such a scenario. Python can more correctly be said to emphasize correctness over speed/efficiency in the same way that concatenating Java, JavaScript, or C# strings directly does.

i.e. Python is in the "shared immutable strings" family and is optimized for that case

What would it look like to "follow Rust's example"?

Where Python offers shared immutable strings and is a good example for how to do so, Rust offers owned mutable append-only strings with an efficient and clean (ignoring the aforementioned) API.

In Pony terms, as I understand them, implementing an approach similar to Rust's would involve the following characteristics.

  • Internally represent strings as UTF-8
  • Represent characters using a 4-byte wide type
  • Make strings character iterable
  • Make strings (character, byte-index) iterable
  • Support byte-index based reading operations using partial functions which check whether the indices are valid (this check is quite inexpensive)
  • Treat the string length as its byte length
  • Require users to use the char and char_indices-like iterators for cases where they want to think in terms of characters, explicitly taking on the cost of iterating over the string.

@jemc
Copy link
Member

jemc commented Mar 2, 2021

@Kylebrown9 - if I understand your proposal correctly, it sounds like you are recommending something like: Keep byte-based operations as the first-class operations, but also have character-based operations.

This seems similar to what we have in Pony today (perhaps with the differences of: preference for UTF8 rather than UTF32 and perhaps with support for more character operations), so I'd be curious to hear you compare your suggested approach to Pony's current approach, rather than the approach proposed in this PR.

@esoterra
Copy link

esoterra commented Mar 2, 2021

In Rust, making a string from bytes is effectively partial because not all byte sequences are valid UTF-8 strings.
Additionally, operations on strings preserve validity of the string as UTF-8, if an operation would violate this it instead fails.
This means that all Rust strings can be assumed to be valid UTF-8 and as I understand it is not something Pony guarantees.
Python actually makes a similar guarantee in a different way by keeping bytes and str distinct, converting between them can fail and operations that make new strings guarantee unicode validity.

As a change to Pony today

  1. Making a string from a sequence of U8 would become partial, only succeeding when the resulting string would be valid utf-8.

  2. Operations that use byte indices (e.g. apply, trim, chop, delete, substring) would become partial, failing if they index somewhere that is not a valid codepoint boundary.

  3. Operations that read a U8 as a character (e.g. apply, pop, shift) could either

    • be made explicitly byte-specific where possible (e.g. pop_byte),
    • be converted to versions that produce U32 or a specific character type,
    • or be removed.
  4. Operations that provide a U8 as a character (e.g. update, unshift, push) could either

    • be made explicitly byte-specific where possible (e.g. ushift_byte) although I wouldn't recommend this most of the time,
    • be converted to versions that take in a U32 or special character type,
    • or be removed.
  5. The runes method would be kept, although I'd be tempted to call it chars, and another iterator would be added that also tells you the byte-index of the current character.

Edit: If this was going to follow the Rust approach, then a different string type(s) could/would be added to handle platform specific strings (see OsString) which may not satisfy these invariants or have other requirements.

@esoterra
Copy link

esoterra commented Mar 2, 2021

Also, I realize I've ended up proposing and defending a specific prescription for how to do strings in Pony.
That wasn't really my intention on arrival, I just ended up doing so in clarifying that I think it's important
to offer a string API with always-valid unicode strings that offers indexing which matches the stored representation.

The language comparisons and proposal outlines I've made are just intended to demonstrate what
I mean by those statements and ways they can be achieved.

@jasoncarr0
Copy link

jasoncarr0 commented Mar 2, 2021

For byte reading in this RFC, we have the current_byte_buffer method (EDIT: actually it still seems to be called array here). Do you think that's clear enough for you to find? And does it make sense to you to have byte-reading but not byte-writing?

@esoterra
Copy link

esoterra commented Mar 2, 2021

The only two things I would ask of a byte-based method on a string is that

  1. it has byte or bytes in the name
  2. and it cannot cause the string to become invalid.

There is an argument to be made that some methods might confuse users or encourage them to ignore the fact
that a string is made of characters and not bytes, and in those cases it may make sense to not include them.
Each method and name should be treated on a case-by-case basis and I don't necessarily have strong views
on what the answer is because I don't know all of the existing idioms and use cases out there that they may impact.

Note: I don't consider a method to be "byte-based" if it does not take in or return U8 values, even if it uses byte-indices.
i.e. substring and chop would not be byte-based as long as they return strings and not bytes

Edit: Forgot to directly answer your question. What you described meets the requirements I would set for it and sounds fine but others may have opinions on those choices.

@rowland66
Copy link
Author

rowland66 commented Mar 3, 2021

Hi Kyle,

There is a lot to digest here. Let me start with your list from above of String features in Rust and how the compare to Pony post adoption of this RFC.

Internally represent strings as UTF-8

Same

Represent characters using a 4-byte wide type

Same. Pony has no Character type, but String functions return unicode codepoints as U32

Make strings character iterable

Same

Make strings (character, byte-index) iterable

No. As mentioned above, String has an array function that is very efficient assuming you want the String bytes UTF-8 encoded. However, the resulting Array is only safe to use if you know that your String contains only ASCII characters. The resulting Array is also a val, so it cannot be modified. Since the Arrays shares the memory used by the String, modifying the Array might corrupt the String.

Support byte-index based reading operations using partial functions which check whether the indices are valid (this check is quite inexpensive)

This would add a bunch of partial functions to String which in undesireable. Byte index values are never safe to change (increment or decrement), so their value is limited to remembing a position in a String. They could be useful if you know that your String is ASCII, but then you can use the array function and work with the String as an array of bytes. There is no right or wrong answer here. Its a decision on whether adding functions that are likely to be misused resulting in runtime errors is worth some added utility provided by correct use of the functions.

Treat the string length as its byte length

There is already a byte_size() function that serves this purpose. Knowing a String size in codepoints is more often what is needed, so I think that the size() behavior is correct.

Require users to use the char and char_indices-like iterators for cases where they want to think in terms of characters, explicitly taking on the cost of iterating over the string.

I think that it makes sense that String be a sequence of Characters, and Characters are no longer single bytes. For applications that want to work with bytes, Pony has the Array class. I think that keeping String's and Arrays of bytes separate and distinct results in less error prone code and I value that highly.

Something worth considering is whether Pony should support alternate internal presentations of String data. The intent is to leave that possibility open for the future.

Another thing to consider is optimizations to the current String class. For example, a String could maintain a boolean indicator of whether it contains any multi-byte encoded characters. In cases where it does not, character index based operations could be much faster. This does add additional data to every String instance, so such a feature would need to be carefully considered.

@esoterra
Copy link

esoterra commented Mar 3, 2021

Hi Rowland,

I think so far it sounds like we're both in agreement about my priority A that
Pony should "offer a string API with always-valid unicode strings".

It seems like we disagree about priority B, that it should use "indexing which matches the stored representation".
My motivation for priority B could be rephrased as "indexing operations should be constant time"
and I think it is born out by (1) the principle of least surprise and (2) the goal of performance.

1. Principle of Least Surprise

Popular programming languages usually offer constant time indexing for strings and I believe users now expect this.
When users write for-loops that iterate over the index range or attempt to perform a series of random accesses because they don't realize this choice was made, they will encounter a performance cost that vastly exceeds their expectation, ergo surprise.

2. Performance

I think there are also other use cases where code that could have worked well on the existing string design or other options we have now will be meaningfully less performant.
e.g. a compiler which tokenizes a string (which may not be all ascii) and then refers to spans of the string using character indices throughout the rest of the compiler. Whenever they want to read that span, they will take on an extra linear time cost and "remembering" the byte-index they're actually referring to is actually quite useful.

@jasoncarr0
Copy link

jasoncarr0 commented Mar 3, 2021

Honestly for 1. I'd be surprised to see constant-time indexing for anything other than ascii.
But for 2. Don't substrings cover this use-case for relatively low space cost? Obviously this costs a bit less in e.g. Rust or C than in Pony. It would be nice to have a raw slice type in Pony for this, but otherwise I would just use a string slice in Rust (i.e. a pointer,length pair)

@rowland66
Copy link
Author

@Kylebrown9 For 1, I agree with @jasoncarr0 why would you expect constant-time indexing for variable size elements. Also there is another way to look at principal of least surprise. I should not be surprised when my code fails because it encounters strings containing multi-byte characters. Having more functions in String that use byte indexes increases the likelyhood of these unpleasant surprises.

For 2, I agree that there is a performance pennalty in some cases for using Strings. In cases where the pennalty is large (a compiler tokenizing text) there is always the option to use Array.

@esoterra
Copy link

esoterra commented Mar 3, 2021

How many average programming language users are thinking actively about the fact that strings are variable length when using a unicode API that gives character indices? I think they'll tend to expect indexing into this Seq[U32] to be O(1).

As a simple example program that I think will surprise people, imagine writing a palindrome checker.
I think many users would iterate from 0 to n/2 comparing the values at index i and n-i (or -i).
I don't think they're going to expect that that strategy will put them in O(n^2).

@esoterra
Copy link

esoterra commented Mar 3, 2021

Here's a table I'm putting together to summarize different language's design choices on this matter.
I've tried to fill in as much of the most popular programming languages on GitHub,
feel free to suggest corrections or additions and I'll render them.

Language Encodings Indexing Notes
JavaScript UTF-16 ? Surprisingly hard to find info on indexing operations
Python Latin-1, UCS-2, UCS-4 O(1) Strings are immutable, encoding selected automatically
Java UTF-16 O(1) Caveat: may only return half a character
C++ ASCII / UTF-8? O(1) ?
Rust UTF-8 O(1) Uses byte-indexes, must occur at codepoint boundary
Ruby ASCII, EUC, UTF-8 ? Allows users to choose and convert encodings
PHP Latin-1 O(1) No default support for unicode

@jasoncarr0
Copy link

Ah, so you're specifically talking about O(1) byte-indexing. I'm more thinking character indexing, and hence I would be surprised to see it be O(1), at least with good character support.

Could you help explain the use case for byte-indexing, particularly if there's something that's not covered by substrings? Or O(1) character-indexing?

@esoterra
Copy link

esoterra commented Mar 4, 2021

Among those examples there are some that index by character (e.g. Python) and some that index by byte (e.g. Rust).
As far as I can tell, irrespective of that choice, most popular programming languages indexing is O(1).
I'm suggesting that regardless of how we choose to index, some people will expect it to be O(1) (see palindrome example).

Most cases for byte-indexing (e.g. compiler lexers & error reporting and text editor piece tables) can probably be replaced by allocating a new substring if the cost of doing so is acceptable.
In Rust, character-based work is usually done by s.chars() with things like s.chars().nth(...) and s.chars().rev().nth(...) filling the role that the proposed character indexing would, while being obvious about their cost.

I think that offering O(n) character indexing is going to trip up some new users and that some form of O(1) indexing or slicing is probably needed for certain specific high-performance scenarios.
However, I'll be fine with any decision if it is aware those concerns and is clear about the chosen performance characteristics.

(I've spent quite a lot of time writing these responses and will be dialing that back, but I can clarify things if needed)

@jasoncarr0
Copy link

Got it, thanks for the thoughtfulness @Kylebrown9

@redvers
Copy link

redvers commented Mar 4, 2021

One more datapoint - The BEAM (erlang, elixir, lfe et al) uses utf-8 encoded binaries as the native string datatype. There are multiple other formats supported for external compatibility - but that is its primary.

All the string operations work on graphemes, not bytes.

@redvers
Copy link

redvers commented Mar 4, 2021

(I'm also going to demonstrate my lack of CS education and admit I don't understand how you can have an algorithm which can identify say the 10,000th character in a string with O(1) performance when you have the variable width encoding that utf8 provides)

@esoterra
Copy link

esoterra commented Mar 4, 2021

The only examples on my list above that uses UTF-8 and offers O(1) indexing (Rust/C++) uses byte-indexing.
i.e. it does not identify the 10,000th character, it identifies the 10,000th byte

@redvers
Copy link

redvers commented Mar 4, 2021

I think the only other thing I think I should mention should be in the "How We Teach" section.

Since we're counting and indexing on codepoints and not graphemes we should probably mention that it is possible to get legal codepoints out of this interface which are illegal graphemes.

For example, in a grapheme-based implementation:

iex(260)> "👨‍🦰" |> String.length
1
iex(261)> "👨‍🦰" |> String.codepoints
["👨", "‍", "🦰"]

ie: "Man" + "ZWJ" + "Red Hair"

In this RFC implementation, the length would be 3 (as there are three codepoints). Without documenting it - that's arguably unexpected behaviour because most (or many) developers don't know that there isn't a 1:1 mapping between codepoints and characters.

Having the interface be grapheme-based would end up having our functions that return single characters have to return Array[U8] again instead of U32. A significantly higher lift for the 99% use-case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants