Sequence.length() not always accurate (e.g. Left-to-right Isolate/ #225

tiptenbrink · 2021-11-09T13:50:11Z

Thanks for the great library! It's a lot more robust and easier to work with than some alternatives.

Sequence.length() uses jquast/wcwidth internally. Unfortunately, it is not accurate for all Unicode characters. These include LRI (U+2066) and PDI (U+2069). For both, wcwidth returns 1 when these characters have length zero as they are not printed in the terminal (I'm using GNOME terminal). This corresponds to jquast/wcwidth#26. A possibility to fix this would be to replace wcwidth with cwcwidth, which is used by curtsies (and bpython) and as a bonus has a much faster implementation.

Context

It's possible some terminals show these as 1 width but that would be incorrect behavior, as LRI and PDI are supposed to simply affect directionality (for LTR, RTL scripts) and not be displayed. For example if you want to display individual Hebrew characters not with actual meaning, but as a binary decoding (which is my strange use case), you want to print each character as '⁦א⁩' (there's a LDI and PDI on the left and right side of the character, respectively), so if you combine multiple your string will be displayed in memory order, e.g. '⁦א⁩⁦ל⁩'. If you would print it normally, you'd get 'אל'. As you can see also in this text, they are invisible.

Of course, some editors do display these characters (e.g. IntelliJ) as they can be sneaked in to alter source code (see for example the security issue that prompted the recent 1.56.1 Rust release) and people who view code in the terminal might want some special characters to reveal the presence of those characters as well. But that should not be the default, as in actual display strings the characters should be invisible.

avylove · 2021-11-13T13:27:26Z

@jquast will need to take a look at this when he gets the time. Just wanted to comment so you know it's not being ignored.

jquast · 2021-11-15T15:26:08Z

I see cwcwidth uses category 'Cf' in the zero width table and wcwidth does not, that is the problem, I will try to address it in the coming weeks in wcwidth thanks

tiptenbrink · 2021-11-17T12:42:12Z

Thanks for looking into it!

jquast · 2023-10-30T23:03:02Z

I know its been a long time, but this is resolved in today's release of wcwidth by jquast/wcwidth#91

best wishes

jquast closed this as completed Oct 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sequence.length() not always accurate (e.g. Left-to-right Isolate/ #225

Sequence.length() not always accurate (e.g. Left-to-right Isolate/ #225

tiptenbrink commented Nov 9, 2021

avylove commented Nov 13, 2021

jquast commented Nov 15, 2021 •

edited

Loading

tiptenbrink commented Nov 17, 2021

jquast commented Oct 30, 2023

Sequence.length() not always accurate (e.g. Left-to-right Isolate/ #225

Sequence.length() not always accurate (e.g. Left-to-right Isolate/ #225

Comments

tiptenbrink commented Nov 9, 2021

Context

avylove commented Nov 13, 2021

jquast commented Nov 15, 2021 • edited Loading

tiptenbrink commented Nov 17, 2021

jquast commented Oct 30, 2023

jquast commented Nov 15, 2021 •

edited

Loading