Wrong width for Hindi on macOS, but correct width on Linux #25

tleonhardt · 2018-06-27T04:38:38Z

I tried using wcwidth to calculate the length of the name for the city of Mumbai in Hindi (बॉम्बे हिंदी)

from wcwidth import wcswidth
wcswidth('बॉम्बे हिंदी')
9

On macOS 10.13.5 using Python 3.6.5, I see a visual width of 5 characters and a calculated width of 9 characters.

On Ubuntu 18.04 using Python 3.6.5, I see a visual width of 9 characters and a calculated width of 9 characters.

Thank you by the way for creating a very useful module!

jquast · 2018-06-27T20:42:05Z

Thank you very much for the specific example, definitely a bug to try to address !

tleonhardt · 2018-06-27T20:50:16Z

In case it is important for future troubleshooting, I was using iTerm 2 version 3.1.6 with Bash version 4.4.23

shreevatsa · 2018-11-27T22:35:58Z

Almost surely, the "correct" width on Linux is because of a broken terminal that does not display the characters correctly. On macOS, the terminal uses the system libraries for text rendering, and properly renders the combining characters. On Linux, the terminal does something horrible.

Here's a much simpler test case:

>>> from wcwidth import wcswidth
>>> wcswidth('का')
2

That's U+‎0915 DEVANAGARI LETTER KA followed by U+‎093E DEVANAGARI VOWEL SIGN AA

It is supposed to be displayed as a single grapheme (e.g. you should not be able to place the cursor between them to type, and definitely you should not see a dotted circle), but on Linux terminals I get very weird results, with क in one cell, and ा in another cell (with the dotted circle).

The wcswidth result of "2" is consistent with this weird result, but obviously incorrect.

I imagine it's similarly broken for all Indic scripts. In fact I cannot imagine how something like wcwidth, which only returns integers, is going to work for Indic, Arabic, etc scripts. Searching for [wcwidth indic] brings up some results like mintty/mintty#553 xtermjs/xterm.js#1468 xtermjs/xterm.js#72 and see the mentions of "Indic" at https://www.cl.cam.ac.uk/~mgk25/unicode.html -- looks like a lot of software is broken.

jquast · 2018-11-28T09:12:54Z

Hello new conversationalist :)

I know the details pretty well, but re-reading the last link for “Indic” somewhere in there makes pretty dire “no support”, anyway the terminal landscape is very rich today, but mostly libc/wcwidth(3) based with little font intelligence, but combining more than has been written here in past years, improving due to emoticon situation I suppose, folks want silly icons aligned.

I personally saw Hindi has the go-to Test characters, I have a combining browser in the repo, you’re welcome to use to modify and view “Indic” language effects,

looks like a lot of software is broken

Well, we can fix it! This library had some light use in terminal landscape for python languages and Python is very readable, folks may be copying us already and I hope they may, into the next future wcwidth(), so please join if you can help :)

I have a scheme for automatically detected Unicode support level by introspection using terminal report cursor position query..

Anyway if anyone is interested in taking on the bulk of the work I will transfer knowledge and always accept test-passing PR’s that make sense.

Otherwise I’m FOSS retired and unable to put in the hours, sorry and good luck

jquast · 2020-06-01T17:33:37Z

Wow, this really screws up in iTerm2, especially if navigating a cursor around the text, which is one of the better terminals for these things. In iTerm2, it is displayed in 1 cell, not 2, but wcwidth determines 2 for this example.

Zarainia · 2021-04-26T16:26:06Z

I am having similar problems with various WSL terminals. Gnome Terminal under an X-server or WSLg displays á (Latin letter a + combining accent) as width 1 (which is what wcwidth returns). Entering WSL from Windows, Windows Terminal and the default bash terminal display it with width 2, Hyper and Terminus displays it as width 1, but ucs-detect detects Unicode version as 13.0.0 for all of them, despite the results being different.

dscrofts · 2022-04-17T01:51:22Z

I too am having issues with this. Here's another specific example:

from wcwidth import wcswidth
wcswidth("चाइनीज")
6

When viewing this using the Kitty terminal, the text only occupies 4 columns, not 6. This seems to be a problem in general with most Hindi text that I've encountered.

Tested on macOS Big Sur 11.6.5 and Kitty 0.24.4

Major ----- Bugfix zero-with characters, closes #57, #47, #45, #39, #26, #25, #24, #22, #8, wow ! This is mostly achieved by replacing `ZERO_WIDTH_CF` with dynamic parsing by Category codes in bin/update-tables.py and putting those in the zero-wide tables. Tests ----- - `verify-table-integrity.py` exercises a "bug" of duplicated tables that has no effect, because wcswidth() first checks for zero-width, and that is preferred in cases of conflict. This PR also resolves that error of duplication. - new automatic tests for balinese, kr jamo, zero-width emoji, devanagari, tamil, kannada. - added pytest-benchmark plugin, example use: # baseline tox -epy312 -- --verbose --benchmark-save=original # compare tox -epy312 -- --verbose --benchmark-compare=.benchmarks/Linux-CPython-3.12-64bit/0001_original.json

jquast · 2023-10-30T19:56:19Z

Zero-width characters used with the Hindi language have been resolved in today's release by #91.

I created a testing tool that verifies it, that at least in the case of "Universal Declaration of Human Rights" from https://unicode.org/udhr/ in Hindi, that wcwidth now agrees in measurement of all words with "kitty" and "mlterm" terminals.

dscrofts · 2023-12-01T20:52:27Z

Zero-width characters used with the Hindi language have been resolved in today's release by #91.

I created a testing tool that verifies it, that at least in the case of "Universal Declaration of Human Rights" from https://unicode.org/udhr/ in Hindi, that wcwidth now agrees in measurement of all words with "kitty" and "mlterm" terminals.

Just found another issue (perhaps a corner-case?) with Hindi:

>>> from wcwidth import wcswidth
>>> wcswidth("गीत")
3

This sequence should only occupy 2 cells.

jquast · 2023-12-01T21:46:12Z

@dscrofts thank you for your persistence, I really do appreciate your help with Hindi!

Can you please check your version of wcwidth is the latest, 0.2.12? This is measured as 2 in my tests.

Just to be sure, here is my test session,

>>> import unicodedata, wcwidth
>>> wcwidth.__version__
'0.2.12'
>>> l='गीत'
>>> wcwidth.wcswidth(l)
2
>>> ', '.join([unicodedata.name(x).title() for x in l])
'Devanagari Letter Ga, Devanagari Vowel Sign Ii, Devanagari Letter Ta'
>>> [unicodedata.category(x) for x in l]
['Lo', 'Mc', 'Lo']

dscrofts · 2023-12-02T01:24:48Z

@dscrofts thank you for your persistence, I really do appreciate your help with Hindi!

Can you please check your version of wcwidth is the latest, 0.2.12? This is measured as 2 in my tests.

Just to be sure, here is my test session,
>>> import unicodedata, wcwidth
>>> wcwidth.__version__
'0.2.12'
>>> l='गीत'
>>> wcwidth.wcswidth(l)
2
>>> ', '.join([unicodedata.name(x).title() for x in l])
'Devanagari Letter Ga, Devanagari Vowel Sign Ii, Devanagari Letter Ta'
>>> [unicodedata.category(x) for x in l]
['Lo', 'Mc', 'Lo']

@jquast Thanks for the quick response!

So I ran your code and get identical output. I think there was an issue copy/pasting the characters in my terminal that lead to the wrong output!

After further investigation, it looks like my issue is rather with truncating the text. Specifically, if it is in the middle of a Hindi sequence. I try to left justify the text to a given width, but it seems to be breaking. This might be a good candidate for #93 to implement. I did see there was support for this previously? Any hints as to how I might go about handling/implementing this?

jquast · 2023-12-02T02:10:11Z

it looks like my issue is rather with truncating the text. Specifically, if it is in the middle of a Hindi sequence

Yes, there would be problems with breaking up a sequence that contains combining characters. It sounds like you are not writing a "left justify" function, but maybe a text wrapping function?

If I write a "wc_ljust" function after the one in the readme, there is no opportunity for truncation. It only appends spaces to fill to the given width, just like the built-in str.ljust() or string formatting like f'{var:<10} does not truncate text:

        def wc_ljust(text, length, padding=' '):
            from wcwidth import wcswidth
            return text + padding * max(0, (length - wcswidth(text)))

The python textwrap module tries to break strings only at whitespace, but the default argument break_long_words=True allows it to also break "long words" into pieces if they are very long, and python's textwrap does not make any effort to take combining characters into account, it does not understand wide or zero-width characters it all.

I did see there was support for this previously?

We have only ever provided the "wc_rjust" example in the readme file for this project.

In issue #93 I am referring to a terminal library of mine, blessed, that has these functions (ljust, rjust, center, and wrap). I think all of these functions would handle Hindi correctly by using break_long_words=False with wrap() to ensure it will not break sequences at combining characters if you just want to try/copy from that -- but please be aware that it will fail with ZWJ emojis and emojis with VS-16 sequences, an example:

>>> import blessed
>>> inp='क़ानून की निग़ाह में सभी समान हैं और सभी बिना भेदभाव के समान क़ानूनी सुरक्षा केस घोषणा का अतिक्रमण करके कोई भी भेद-भाव किया जाया उस प्रकार के भेद-भाव को किस, तो उसके विरुद्ध समान संरक्षण का अधिकार सभी को प्राप्त है ।'
>>> lines=blessed.Terminal().wrap(inp, 4, break_long_words=False)
>>> print('-|-'.join(lines))  # display word break locations with '-|-'
print('-|-'.join(lines))
क़ानून-|-की-|-निग़ाह-|-में सभी-|-समान-|-हैं और-|-सभी-|-बिना-|-भेदभाव-|-के-|-समान-|-क़ानूनी-|-सुरक्षा-|-के-|-अधिकारी-|-हैं ।-|-यदि-|-इस-|-घोषणा-|-का-|-अतिक्रमण-|-करके-|-कोई भी-|-भेद-भाव-|-किया-|-जाया-|-उस-|-प्रकार-|-के-|-भेद-भाव-|-को किसी-|-प्रकार-|-से-|-उकसाया-|-जाया,-|-तो-|-उसके-|-विरुद्ध-|-समान-|-संरक्षण-|-का-|-अधिकार-|-सभी को-|-प्राप्त-|-है ।
>>> print(list(map(wcwidth.wcswidth, lines))) # display length of each line
[3, 1, 3, 4, 3, 4, 2, 2, 4, 1, 3, 3, 4, 1, 4, 3, 2, 2, 3, 1, 6, 3, 4, 5, 2, 2, 2, 4, 1, 5, 4, 4, 1, 4, 3, 1, 3, 4, 3, 5, 1, 4, 4, 4, 3]

Requesting the words to be broken at width of 4 withbreak_long_words=False, the blessed.Terminal.wrap() function will not attempt to break those words any shorter, preventing it from truncating any words especially at combining marks.

dscrofts · 2023-12-02T02:50:04Z

@jquast you are absolutely correct, I should be using text wrapping instead of ljust. In fact, a combination of both is what I need to have things line up correctly. Funnily enough I am already using blessed in my project, so now I ljust the wrap()'d text and all is working great. Thanks for the help and all your hard work with wcwidth and blessed :)

tleonhardt mentioned this issue Jun 27, 2018

table_display.py example now uses tableformatter instead of tabulate python-cmd2/cmd2#456

Merged

tleonhardt changed the title ~~Wrong results for Hindi~~ Wrong results for Hindi on macOS Jun 27, 2018

tleonhardt changed the title ~~Wrong results for Hindi on macOS~~ Wrong width for Hindi on macOS Jun 27, 2018

tleonhardt changed the title ~~Wrong width for Hindi on macOS~~ Wrong width for Hindi on macOS, but correct width on Linux Jun 27, 2018

jquast added the needs-research label Jun 1, 2020

jquast mentioned this issue Oct 19, 2023

Bugfixes for zero-width characters #91

Merged

jquast closed this as completed Oct 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong width for Hindi on macOS, but correct width on Linux #25

Wrong width for Hindi on macOS, but correct width on Linux #25

tleonhardt commented Jun 27, 2018 •

edited

Loading

jquast commented Jun 27, 2018

tleonhardt commented Jun 27, 2018

shreevatsa commented Nov 27, 2018

jquast commented Nov 28, 2018

jquast commented Jun 1, 2020

Zarainia commented Apr 26, 2021

dscrofts commented Apr 17, 2022

jquast commented Oct 30, 2023

dscrofts commented Dec 1, 2023 •

edited

Loading

jquast commented Dec 1, 2023

dscrofts commented Dec 2, 2023

jquast commented Dec 2, 2023

dscrofts commented Dec 2, 2023

Wrong width for Hindi on macOS, but correct width on Linux #25

Wrong width for Hindi on macOS, but correct width on Linux #25

Comments

tleonhardt commented Jun 27, 2018 • edited Loading

jquast commented Jun 27, 2018

tleonhardt commented Jun 27, 2018

shreevatsa commented Nov 27, 2018

jquast commented Nov 28, 2018

jquast commented Jun 1, 2020

Zarainia commented Apr 26, 2021

dscrofts commented Apr 17, 2022

jquast commented Oct 30, 2023

dscrofts commented Dec 1, 2023 • edited Loading

jquast commented Dec 1, 2023

dscrofts commented Dec 2, 2023

jquast commented Dec 2, 2023

dscrofts commented Dec 2, 2023

tleonhardt commented Jun 27, 2018 •

edited

Loading

dscrofts commented Dec 1, 2023 •

edited

Loading