You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Following the discussion of this issue there was an agreement that the String class has some paint points that I myself ran into.
One issue is that the String class is not encoding aware and expose the underlying byte sequence. So a user could start with a valid string with a known encoding (UTF-8 for instance) and end-up with an invalid one.
As a user I would like to have a String class that do not expose the inner byte array. As such it should have the following properties:
The string should be encoding aware. Either by design (only support an encoding and reject everything else) or because it is in the the type with the actual data.
The hard-coded string in source code should have a well defined encoding. I think UTF-8 makes a good default. As such, the compiler should reject any file that is not as strictly valid UTF-8 document (or other encoding if UTF-8 is not the default)
The string class should not expose functions that let the user alter the inner buffer in such a way that it become invalid.
The string constructor using "untrusted values" (like a byte array for example) should be partial and fail on invalid input to be coherent with the previous point.
The string class should implement Seq[String] instead of Seq[U8] as it is more convenient to represent a character as a string of length 1 than exposing the internal byte array.
Any string function that use index should use index in terms of characters instead of index of byte in the internal array. Otherwise it is error prone as some common encoding (for instance UTF-8) has variable width characters.
The null terminator concern should only surface when the user want to interact with external code with FFI.
It would be nicer if the standard library could provide encoding conversion (UTF-8 <=> UTF32 and so on) and even nicer if the common string manipulation would work on every string class no matter which encoding is used.
Here is an example that illustrates a few of the problems:
use"format"actorMainnewcreate(env: Env) =>
let sentence = " this is a pony -> 🐎"trylet index_of_pony = sentence.find("🐎")?
(let char_u32 : U32, let nb_bytes_of_char : U8) = sentence.utf32(index_of_pony)?
env.out.print(recovervalString.from_utf32(char_u32) end) // correct
env.out.print(sentence.substring(index_of_pony, index_of_pony + 1)) // incorrect
env.out.print(sentence.substring(index_of_pony, index_of_pony + 4)) //correctendvar sentence2 : Stringref = "hi".clone() // so far valid UTF-8let invalid_code_point = U32(0x11FFF) // 0x11FFF is not a valid codepoint look at http://unicode.org/faq/utf_bom.html#gen6
sentence2.push_utf32(invalid_code_point)
env.out.print(recoverval sentence2.clone() end)
try
(let char_u32 : U32, let nb_bytes : U8) = sentence2.utf32(2)?
env.out.print(
"char as codepoint is " + Format.int[U32](char_u32, FormatHex) +
" takes " + nb_bytes.string() + " bytes")
end
As you can see in the first try, finding a character in a string is actually error prone. Calling naively the substring function can return a invalid UTF-8 string even in the original one is valid. And the input string in only a valid UTF-8 string if the code sample is itself encoded in UTF-8. In the second try we see that we can insert in a valid UTF-8 string a codepoint that is not defined by the unicode standard (or so it seems).
The text was updated successfully, but these errors were encountered:
A good reference to use for a safe string API would be Rust's std::string::String and std::ffi::OsString which allow users to interact with valid UTF-8 strings and platform-specific string representations respectively.
There are a few differences between Rust's string design and what is proposed above.
Rust strings have a String::chars method which allows iterating over the char type instead of treating it as a sequence of strings itself.
Rust additionally offers a String::char_indices method that allows you to see the byte indices of each character you iterate over
Rust does allow users to do some indexing into strings as byte arrays, but doing so risks a panic (example with index) or returns an Option (example with get).
Users have a tendency to assume indexing is a constant time operation which is not the case for character indexing into UTF-8 and I believe is why that ability is only offered through an iterator (e.g. s.chars().nth(...)) in Rust.
Following the discussion of this issue there was an agreement that the String class has some paint points that I myself ran into.
One issue is that the String class is not encoding aware and expose the underlying byte sequence. So a user could start with a valid string with a known encoding (UTF-8 for instance) and end-up with an invalid one.
As a user I would like to have a String class that do not expose the inner byte array. As such it should have the following properties:
Here is an example that illustrates a few of the problems:
As you can see in the first try, finding a character in a string is actually error prone. Calling naively the substring function can return a invalid UTF-8 string even in the original one is valid. And the input string in only a valid UTF-8 string if the code sample is itself encoded in UTF-8. In the second try we see that we can insert in a valid UTF-8 string a codepoint that is not defined by the unicode standard (or so it seems).
The text was updated successfully, but these errors were encountered: