Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes for rfc-179 unicode string #3679

Closed
wants to merge 5 commits into from
Closed

Conversation

rowland66
Copy link

@rowland66 rowland66 commented Dec 1, 2020

Implementation of rfc-179 improving unicode support in the String class.

@chalcolith
Copy link
Member

@rowland66 after #3683 is merged you can rebase on master (or at least cherry-pick the changes to .cirrus.yml) to make the Windows checks run.

@SeanTAllen SeanTAllen added the changelog - changed Automatically add "Changed" CHANGELOG entry on merge label Dec 9, 2020
@ponylang-main
Copy link
Contributor

Hi @rowland66,

The changelog - changed label was added to this pull request; all PRs with a changelog label need to have release notes included as part of the PR. If you haven't added release notes already, please do.

Release notes are added by creating a uniquely named file in the .release-notes directory. We suggest you call the file 3679.md to match the number of this pull request.

The basic format of the release notes (using markdown) should be:

## Title

End user description of changes, why it's important,
problems it solves etc.

If a breaking change, make sure to include 1 or more
examples what code would look like prior to this change
and how to update it to work after this change.

Thanks.

@SeanTAllen SeanTAllen added the do not merge This PR should not be merged at this time label Dec 9, 2020
@SeanTAllen
Copy link
Member

I'm marking as "DO NOT MERGE" given the RFC hasn't been accepted yet and I don't want this to get accidentally merged.

@@ -380,7 +380,12 @@ class Array[A] is Seq[A]
Truncate an array to the given length, discarding excess elements. If the
array is already smaller than len, do nothing.
"""
_size = _size.min(len)
if len >= _alloc then
_size = len.min(_alloc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if len >= _alloc then len.min(_alloc) is just _alloc.
Or these two len.min(_alloc) lines could be pulled out

Returns the byte array underlying the string. This buffer will contain
bytes of the String codepoints in the default system encoding (UTF-8).
The array will not reflect all changes in the String from which it is
obtained. This is an unsafe function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about "This is an unsafe function." as despite a few weird bits of behavior this doesn't cause any "unsafety" per se and it's a bit scary.

Maybe more like:
"The array may not reflect future changes in the String from which it is obtained.

This function is meant to supply low-level access to the bytes of a string, the array function provides a safer interface."

"""
Increase the size of a string to the give len in bytes. This is an
unsafe operation, and should only be used when string's _ptr has
been manipulated through a FFI call and the string size is known.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This didn't get caught in the RFC as is stands, but as I'm reading this, this can easily cause UB/segfaults in pure, isolated Pony code. I think this needs some form of authorization or require an extra FFI call.

Can these use cases use from_cpointer instead, which takes authorization in the form of a non-null Pointer[U8] ref?

var last: USize = 0
let offset = _offset_to_index(from.isize())

if (to > to.isize().max_value().usize()) then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra parens here?


// use the new size' for alloc if we're not including the last used byte
// from the original data and only include the extra allocated bytes if
// we're including the last byte.
_alloc = if last == _size then _alloc - offset else size' end
_alloc = if to == _size then _alloc - from else size' end

_size = size'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory safety issues here, the user can set _size to more or less anything with this. I think this needs some extra capping on the to and from

end

while (((inc > 0) and (i < limit) and (n <= offset)) or
((inc < 0) and (i >= 0) and (n > offset))) do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like an extra set of parens

"""
Return an iterator over the codepoints in the string.
Return an iterator over the codepoint in the string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was unintentional?

end

class _LimittedIterator[A] is Iterator[A]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should be Limited

@@ -66,7 +66,7 @@ class _TestPing is UDPNotify
=>
_h.complete_action("ping receive")

let s = String .> append(consume data)
let s = recover val String.from_iso_array(consume data) end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recover is a bit odd here since it's dropping from iso^ to val, perhaps annotate s: String val

@@ -106,7 +106,7 @@ class _TestPong is UDPNotify
=>
_h.complete_action("pong receive")

let s = String .> append(consume data)
let s = recover val String.from_iso_array(consume data) end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

@rowland66
Copy link
Author

rowland66 commented Dec 14, 2020 via email

@jasoncarr0
Copy link
Contributor

jasoncarr0 commented Dec 14, 2020

Thanks for the responses @rowland66. I didn't think to catch that the private methods' safety conditions were being checked by the calling public methods. And I totally get you on that muscle memory haha. I make exactly the same mistake when switching between Java and parens-less languages.

Wrt. resize I think in general "dangerous" behavior is fine, but (and I think others could speak to other methods for ensuring it) I don't believe there's any flexibility with regards to memory safety in pure Pony code

@jemc
Copy link
Member

jemc commented Dec 15, 2020

Discussed the resize/resize_bytes issue in the sync call.

I suggest renaming it truncate_bytes, and ensuring that it can only reduce the size of the buffer and never increase it, to avoid the memory safety issue. This would be at parity with how truncate previously worked, right?

@rowland66
Copy link
Author

rowland66 commented Dec 16, 2020 via email

@jasoncarr0
Copy link
Contributor

Oh, if we support resizing up to the length of the buffer, then that is reasonable to be named resize.
The only pitfall with that is we may need to ensure the entire buffer is always initialized for all strings in that case, based on my understanding of the codegen wrt. LLVMs memory model

@rowland66
Copy link
Author

rowland66 commented Dec 18, 2020 via email

Base automatically changed from master to main February 8, 2021 23:02
@SeanTAllen
Copy link
Member

@rowland66 are you still interested in working on this?

@rowland66
Copy link
Author

rowland66 commented Jan 31, 2022 via email

@SeanTAllen
Copy link
Member

@ponylang/committer I feel like we should close this as it is rather out of date. Agreed?

@ponylang-main ponylang-main added the discuss during sync Should be discussed during an upcoming sync label Jan 29, 2024
@SeanTAllen SeanTAllen closed this Jan 30, 2024
@ponylang-main ponylang-main removed the discuss during sync Should be discussed during an upcoming sync label Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog - changed Automatically add "Changed" CHANGELOG entry on merge do not merge This PR should not be merged at this time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants