String class is error prone and has usability issues #107

codec-abc · 2017-10-23T14:55:16Z

Following the discussion of this issue there was an agreement that the String class has some paint points that I myself ran into.

One issue is that the String class is not encoding aware and expose the underlying byte sequence. So a user could start with a valid string with a known encoding (UTF-8 for instance) and end-up with an invalid one.

As a user I would like to have a String class that do not expose the inner byte array. As such it should have the following properties:

The string should be encoding aware. Either by design (only support an encoding and reject everything else) or because it is in the the type with the actual data.
The hard-coded string in source code should have a well defined encoding. I think UTF-8 makes a good default. As such, the compiler should reject any file that is not as strictly valid UTF-8 document (or other encoding if UTF-8 is not the default)
The string class should not expose functions that let the user alter the inner buffer in such a way that it become invalid.
The string constructor using "untrusted values" (like a byte array for example) should be partial and fail on invalid input to be coherent with the previous point.
The string class should implement Seq[String] instead of Seq[U8] as it is more convenient to represent a character as a string of length 1 than exposing the internal byte array.
Any string function that use index should use index in terms of characters instead of index of byte in the internal array. Otherwise it is error prone as some common encoding (for instance UTF-8) has variable width characters.
The null terminator concern should only surface when the user want to interact with external code with FFI.
It would be nicer if the standard library could provide encoding conversion (UTF-8 <=> UTF32 and so on) and even nicer if the common string manipulation would work on every string class no matter which encoding is used.

Here is an example that illustrates a few of the problems:

use "format"

actor Main
  new create(env: Env) =>
    let sentence = " this is a pony -> 🐎"
    try 
      let index_of_pony = sentence.find("🐎")?
      
      (let char_u32 : U32, let nb_bytes_of_char : U8) = sentence.utf32(index_of_pony)?
      env.out.print(recover val String.from_utf32(char_u32) end) // correct
      
      env.out.print(sentence.substring(index_of_pony, index_of_pony + 1)) // incorrect
      env.out.print(sentence.substring(index_of_pony, index_of_pony + 4)) //correct
    end
    
    var sentence2 : String ref = "hi".clone() // so far valid UTF-8
    let invalid_code_point = U32(0x11FFF) // 0x11FFF is not a valid codepoint look at http://unicode.org/faq/utf_bom.html#gen6
    sentence2.push_utf32(invalid_code_point)
    env.out.print(recover val sentence2.clone() end)
    try
      (let char_u32 : U32, let nb_bytes : U8) = sentence2.utf32(2)?
      env.out.print(
        "char as codepoint is " + Format.int[U32](char_u32, FormatHex) + 
        " takes " + nb_bytes.string() + " bytes")
    end

As you can see in the first try, finding a character in a string is actually error prone. Calling naively the substring function can return a invalid UTF-8 string even in the original one is valid. And the input string in only a valid UTF-8 string if the code sample is itself encoded in UTF-8. In the second try we see that we can insert in a valid UTF-8 string a codepoint that is not defined by the unicode standard (or so it seems).

esoterra · 2021-02-27T23:15:22Z

A good reference to use for a safe string API would be Rust's std::string::String and std::ffi::OsString which allow users to interact with valid UTF-8 strings and platform-specific string representations respectively.

There are a few differences between Rust's string design and what is proposed above.

Rust strings have a String::chars method which allows iterating over the char type instead of treating it as a sequence of strings itself.
Rust additionally offers a String::char_indices method that allows you to see the byte indices of each character you iterate over
Rust does allow users to do some indexing into strings as byte arrays, but doing so risks a panic (example with index) or returns an Option (example with get).

Users have a tendency to assume indexing is a constant time operation which is not the case for character indexing into UTF-8 and I believe is why that ability is only offered through an iterator (e.g. s.chars().nth(...)) in Rust.

SeanTAllen · 2021-02-27T23:17:12Z

FYI there's a String RFC that is being worked on so anyone interested in this might want to check that out:

#179

mfelsche added the needs discussion at sync label Nov 8, 2017

SeanTAllen removed the needs discussion at sync label May 25, 2020

esoterra mentioned this issue Feb 28, 2021

Unicode String #179

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String class is error prone and has usability issues #107

String class is error prone and has usability issues #107

codec-abc commented Oct 23, 2017 •

edited

Loading

esoterra commented Feb 27, 2021

SeanTAllen commented Feb 27, 2021

String class is error prone and has usability issues #107

String class is error prone and has usability issues #107

Comments

codec-abc commented Oct 23, 2017 • edited Loading

esoterra commented Feb 27, 2021

SeanTAllen commented Feb 27, 2021

codec-abc commented Oct 23, 2017 •

edited

Loading