Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequence.length() not always accurate (e.g. Left-to-right Isolate/ #225

Closed
tiptenbrink opened this issue Nov 9, 2021 · 4 comments
Closed

Comments

@tiptenbrink
Copy link

Thanks for the great library! It's a lot more robust and easier to work with than some alternatives.

Sequence.length() uses jquast/wcwidth internally. Unfortunately, it is not accurate for all Unicode characters. These include LRI (U+2066) and PDI (U+2069). For both, wcwidth returns 1 when these characters have length zero as they are not printed in the terminal (I'm using GNOME terminal). This corresponds to jquast/wcwidth#26. A possibility to fix this would be to replace wcwidth with cwcwidth, which is used by curtsies (and bpython) and as a bonus has a much faster implementation.

Context

It's possible some terminals show these as 1 width but that would be incorrect behavior, as LRI and PDI are supposed to simply affect directionality (for LTR, RTL scripts) and not be displayed. For example if you want to display individual Hebrew characters not with actual meaning, but as a binary decoding (which is my strange use case), you want to print each character as '⁦א⁩' (there's a LDI and PDI on the left and right side of the character, respectively), so if you combine multiple your string will be displayed in memory order, e.g. '⁦א⁩⁦ל⁩'. If you would print it normally, you'd get 'אל'. As you can see also in this text, they are invisible.

Of course, some editors do display these characters (e.g. IntelliJ) as they can be sneaked in to alter source code (see for example the security issue that prompted the recent 1.56.1 Rust release) and people who view code in the terminal might want some special characters to reveal the presence of those characters as well. But that should not be the default, as in actual display strings the characters should be invisible.

@avylove
Copy link
Collaborator

avylove commented Nov 13, 2021

@jquast will need to take a look at this when he gets the time. Just wanted to comment so you know it's not being ignored.

@jquast
Copy link
Owner

jquast commented Nov 15, 2021

I see cwcwidth uses category 'Cf' in the zero width table and wcwidth does not, that is the problem, I will try to address it in the coming weeks in wcwidth thanks

@tiptenbrink
Copy link
Author

Thanks for looking into it!

@jquast
Copy link
Owner

jquast commented Oct 30, 2023

I know its been a long time, but this is resolved in today's release of wcwidth by jquast/wcwidth#91

best wishes

@jquast jquast closed this as completed Oct 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants