Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong width for Hindi on macOS, but correct width on Linux #25

Closed
tleonhardt opened this issue Jun 27, 2018 · 13 comments
Closed

Wrong width for Hindi on macOS, but correct width on Linux #25

tleonhardt opened this issue Jun 27, 2018 · 13 comments

Comments

@tleonhardt
Copy link

tleonhardt commented Jun 27, 2018

I tried using wcwidth to calculate the length of the name for the city of Mumbai in Hindi (बॉम्बे हिंदी)

from wcwidth import wcswidth
wcswidth('बॉम्बे हिंदी')
9

On macOS 10.13.5 using Python 3.6.5, I see a visual width of 5 characters and a calculated width of 9 characters.

On Ubuntu 18.04 using Python 3.6.5, I see a visual width of 9 characters and a calculated width of 9 characters.

Thank you by the way for creating a very useful module!

@tleonhardt tleonhardt changed the title Wrong results for Hindi Wrong results for Hindi on macOS Jun 27, 2018
@tleonhardt tleonhardt changed the title Wrong results for Hindi on macOS Wrong width for Hindi on macOS Jun 27, 2018
@tleonhardt tleonhardt changed the title Wrong width for Hindi on macOS Wrong width for Hindi on macOS, but correct width on Linux Jun 27, 2018
@jquast
Copy link
Owner

jquast commented Jun 27, 2018

Thank you very much for the specific example, definitely a bug to try to address !

@tleonhardt
Copy link
Author

In case it is important for future troubleshooting, I was using iTerm 2 version 3.1.6 with Bash version 4.4.23

@shreevatsa
Copy link

Almost surely, the "correct" width on Linux is because of a broken terminal that does not display the characters correctly. On macOS, the terminal uses the system libraries for text rendering, and properly renders the combining characters. On Linux, the terminal does something horrible.

Here's a much simpler test case:

>>> from wcwidth import wcswidth
>>> wcswidth('का')
2

That's U+‎0915 DEVANAGARI LETTER KA followed by U+‎093E DEVANAGARI VOWEL SIGN AA

It is supposed to be displayed as a single grapheme (e.g. you should not be able to place the cursor between them to type, and definitely you should not see a dotted circle), but on Linux terminals I get very weird results, with क in one cell, and ा in another cell (with the dotted circle).

The wcswidth result of "2" is consistent with this weird result, but obviously incorrect.

I imagine it's similarly broken for all Indic scripts. In fact I cannot imagine how something like wcwidth, which only returns integers, is going to work for Indic, Arabic, etc scripts. Searching for [wcwidth indic] brings up some results like mintty/mintty#553 xtermjs/xterm.js#1468 xtermjs/xterm.js#72 and see the mentions of "Indic" at https://www.cl.cam.ac.uk/~mgk25/unicode.html -- looks like a lot of software is broken.

@jquast
Copy link
Owner

jquast commented Nov 28, 2018

Hello new conversationalist :)

I know the details pretty well, but re-reading the last link for “Indic” somewhere in there makes pretty dire “no support”, anyway the terminal landscape is very rich today, but mostly libc/wcwidth(3) based with little font intelligence, but combining more than has been written here in past years, improving due to emoticon situation I suppose, folks want silly icons aligned.

I personally saw Hindi has the go-to Test characters, I have a combining browser in the repo, you’re welcome to use to modify and view “Indic” language effects,

looks like a lot of software is broken

Well, we can fix it! This library had some light use in terminal landscape for python languages and Python is very readable, folks may be copying us already and I hope they may, into the next future wcwidth(), so please join if you can help :)

I have a scheme for automatically detected Unicode support level by introspection using terminal report cursor position query..

Anyway if anyone is interested in taking on the bulk of the work I will transfer knowledge and always accept test-passing PR’s that make sense.

Otherwise I’m FOSS retired and unable to put in the hours, sorry and good luck

@jquast
Copy link
Owner

jquast commented Jun 1, 2020

Wow, this really screws up in iTerm2, especially if navigating a cursor around the text, which is one of the better terminals for these things. In iTerm2, it is displayed in 1 cell, not 2, but wcwidth determines 2 for this example.

@Zarainia
Copy link

I am having similar problems with various WSL terminals. Gnome Terminal under an X-server or WSLg displays (Latin letter a + combining accent) as width 1 (which is what wcwidth returns). Entering WSL from Windows, Windows Terminal and the default bash terminal display it with width 2, Hyper and Terminus displays it as width 1, but ucs-detect detects Unicode version as 13.0.0 for all of them, despite the results being different.

@dscrofts
Copy link

I too am having issues with this. Here's another specific example:

from wcwidth import wcswidth
wcswidth("चाइनीज")
6

When viewing this using the Kitty terminal, the text only occupies 4 columns, not 6. This seems to be a problem in general with most Hindi text that I've encountered.

Tested on macOS Big Sur 11.6.5 and Kitty 0.24.4

jquast added a commit that referenced this issue Oct 30, 2023
Major
-----

Bugfix zero-with characters, closes #57, #47, #45, #39, #26, #25, #24, #22, #8, wow !

This is mostly achieved by replacing `ZERO_WIDTH_CF` with dynamic parsing by Category codes in bin/update-tables.py and putting those in the zero-wide tables.

Tests
-----

- `verify-table-integrity.py` exercises a "bug" of duplicated tables that has no effect, because wcswidth() first checks for zero-width, and that is preferred in cases of conflict. This PR also resolves that error of duplication.
- new automatic tests for balinese, kr jamo, zero-width emoji, devanagari, tamil, kannada.  
- added pytest-benchmark plugin, example use:

        # baseline
        tox -epy312 -- --verbose --benchmark-save=original
        # compare
        tox -epy312 -- --verbose --benchmark-compare=.benchmarks/Linux-CPython-3.12-64bit/0001_original.json
@jquast
Copy link
Owner

jquast commented Oct 30, 2023

Zero-width characters used with the Hindi language have been resolved in today's release by #91.

I created a testing tool that verifies it, that at least in the case of "Universal Declaration of Human Rights" from https://unicode.org/udhr/ in Hindi, that wcwidth now agrees in measurement of all words with "kitty" and "mlterm" terminals.

@jquast jquast closed this as completed Oct 30, 2023
@dscrofts
Copy link

dscrofts commented Dec 1, 2023

Zero-width characters used with the Hindi language have been resolved in today's release by #91.

I created a testing tool that verifies it, that at least in the case of "Universal Declaration of Human Rights" from https://unicode.org/udhr/ in Hindi, that wcwidth now agrees in measurement of all words with "kitty" and "mlterm" terminals.

Just found another issue (perhaps a corner-case?) with Hindi:

>>> from wcwidth import wcswidth
>>> wcswidth("गीत")
3

This sequence should only occupy 2 cells.

@jquast
Copy link
Owner

jquast commented Dec 1, 2023

@dscrofts thank you for your persistence, I really do appreciate your help with Hindi!

Can you please check your version of wcwidth is the latest, 0.2.12? This is measured as 2 in my tests.

Just to be sure, here is my test session,

>>> import unicodedata, wcwidth
>>> wcwidth.__version__
'0.2.12'
>>> l='गीत'
>>> wcwidth.wcswidth(l)
2
>>> ', '.join([unicodedata.name(x).title() for x in l])
'Devanagari Letter Ga, Devanagari Vowel Sign Ii, Devanagari Letter Ta'
>>> [unicodedata.category(x) for x in l]
['Lo', 'Mc', 'Lo']

@dscrofts
Copy link

dscrofts commented Dec 2, 2023

@dscrofts thank you for your persistence, I really do appreciate your help with Hindi!

Can you please check your version of wcwidth is the latest, 0.2.12? This is measured as 2 in my tests.

Just to be sure, here is my test session,

>>> import unicodedata, wcwidth
>>> wcwidth.__version__
'0.2.12'
>>> l='गीत'
>>> wcwidth.wcswidth(l)
2
>>> ', '.join([unicodedata.name(x).title() for x in l])
'Devanagari Letter Ga, Devanagari Vowel Sign Ii, Devanagari Letter Ta'
>>> [unicodedata.category(x) for x in l]
['Lo', 'Mc', 'Lo']

@jquast Thanks for the quick response!

So I ran your code and get identical output. I think there was an issue copy/pasting the characters in my terminal that lead to the wrong output!

After further investigation, it looks like my issue is rather with truncating the text. Specifically, if it is in the middle of a Hindi sequence. I try to left justify the text to a given width, but it seems to be breaking. This might be a good candidate for #93 to implement. I did see there was support for this previously? Any hints as to how I might go about handling/implementing this?

@jquast
Copy link
Owner

jquast commented Dec 2, 2023

it looks like my issue is rather with truncating the text. Specifically, if it is in the middle of a Hindi sequence

Yes, there would be problems with breaking up a sequence that contains combining characters. It sounds like you are not writing a "left justify" function, but maybe a text wrapping function?

If I write a "wc_ljust" function after the one in the readme, there is no opportunity for truncation. It only appends spaces to fill to the given width, just like the built-in str.ljust() or string formatting like f'{var:<10} does not truncate text:

        def wc_ljust(text, length, padding=' '):
            from wcwidth import wcswidth
            return text + padding * max(0, (length - wcswidth(text)))

The python textwrap module tries to break strings only at whitespace, but the default argument break_long_words=True allows it to also break "long words" into pieces if they are very long, and python's textwrap does not make any effort to take combining characters into account, it does not understand wide or zero-width characters it all.

I did see there was support for this previously?

We have only ever provided the "wc_rjust" example in the readme file for this project.

In issue #93 I am referring to a terminal library of mine, blessed, that has these functions (ljust, rjust, center, and wrap). I think all of these functions would handle Hindi correctly by using break_long_words=False with wrap() to ensure it will not break sequences at combining characters if you just want to try/copy from that -- but please be aware that it will fail with ZWJ emojis and emojis with VS-16 sequences, an example:

>>> import blessed
>>> inp='क़ानून की निग़ाह में सभी समान हैं और सभी बिना भेदभाव के समान क़ानूनी सुरक्षा केस घोषणा का अतिक्रमण करके कोई भी भेद-भाव किया जाया उस प्रकार के भेद-भाव को किस, तो उसके विरुद्ध समान संरक्षण का अधिकार सभी को प्राप्त है ।'
>>> lines=blessed.Terminal().wrap(inp, 4, break_long_words=False)
>>> print('-|-'.join(lines))  # display word break locations with '-|-'
print('-|-'.join(lines))
क़ानून-|-की-|-निग़ाह-|-में सभी-|-समान-|-हैं और-|-सभी-|-बिना-|-भेदभाव-|-के-|-समान-|-क़ानूनी-|-सुरक्षा-|-के-|-अधिकारी-|-हैं-|-यदि-|-इस-|-घोषणा-|-का-|-अतिक्रमण-|-करके-|-कोई भी-|-भेद-भाव-|-किया-|-जाया-|-उस-|-प्रकार-|-के-|-भेद-भाव-|-को किसी-|-प्रकार-|-से-|-उकसाया-|-जाया,-|-तो-|-उसके-|-विरुद्ध-|-समान-|-संरक्षण-|-का-|-अधिकार-|-सभी को-|-प्राप्त-|-है>>> print(list(map(wcwidth.wcswidth, lines))) # display length of each line
[3, 1, 3, 4, 3, 4, 2, 2, 4, 1, 3, 3, 4, 1, 4, 3, 2, 2, 3, 1, 6, 3, 4, 5, 2, 2, 2, 4, 1, 5, 4, 4, 1, 4, 3, 1, 3, 4, 3, 5, 1, 4, 4, 4, 3]

Requesting the words to be broken at width of 4 withbreak_long_words=False, the blessed.Terminal.wrap() function will not attempt to break those words any shorter, preventing it from truncating any words especially at combining marks.

@dscrofts
Copy link

dscrofts commented Dec 2, 2023

@jquast you are absolutely correct, I should be using text wrapping instead of ljust. In fact, a combination of both is what I need to have things line up correctly. Funnily enough I am already using blessed in my project, so now I ljust the wrap()'d text and all is working great. Thanks for the help and all your hard work with wcwidth and blessed :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants