-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong width for Hindi on macOS, but correct width on Linux #25
Comments
Thank you very much for the specific example, definitely a bug to try to address ! |
In case it is important for future troubleshooting, I was using iTerm 2 version 3.1.6 with Bash version 4.4.23 |
Almost surely, the "correct" width on Linux is because of a broken terminal that does not display the characters correctly. On macOS, the terminal uses the system libraries for text rendering, and properly renders the combining characters. On Linux, the terminal does something horrible. Here's a much simpler test case:
That's U+0915 DEVANAGARI LETTER KA followed by U+093E DEVANAGARI VOWEL SIGN AA It is supposed to be displayed as a single grapheme (e.g. you should not be able to place the cursor between them to type, and definitely you should not see a dotted circle), but on Linux terminals I get very weird results, with क in one cell, and ा in another cell (with the dotted circle). The wcswidth result of "2" is consistent with this weird result, but obviously incorrect. I imagine it's similarly broken for all Indic scripts. In fact I cannot imagine how something like wcwidth, which only returns integers, is going to work for Indic, Arabic, etc scripts. Searching for [wcwidth indic] brings up some results like mintty/mintty#553 xtermjs/xterm.js#1468 xtermjs/xterm.js#72 and see the mentions of "Indic" at https://www.cl.cam.ac.uk/~mgk25/unicode.html -- looks like a lot of software is broken. |
Hello new conversationalist :) I know the details pretty well, but re-reading the last link for “Indic” somewhere in there makes pretty dire “no support”, anyway the terminal landscape is very rich today, but mostly libc/wcwidth(3) based with little font intelligence, but combining more than has been written here in past years, improving due to emoticon situation I suppose, folks want silly icons aligned. I personally saw Hindi has the go-to Test characters, I have a combining browser in the repo, you’re welcome to use to modify and view “Indic” language effects,
Well, we can fix it! This library had some light use in terminal landscape for python languages and Python is very readable, folks may be copying us already and I hope they may, into the next future wcwidth(), so please join if you can help :) I have a scheme for automatically detected Unicode support level by introspection using terminal report cursor position query.. Anyway if anyone is interested in taking on the bulk of the work I will transfer knowledge and always accept test-passing PR’s that make sense. Otherwise I’m FOSS retired and unable to put in the hours, sorry and good luck |
Wow, this really screws up in iTerm2, especially if navigating a cursor around the text, which is one of the better terminals for these things. In iTerm2, it is displayed in 1 cell, not 2, but wcwidth determines 2 for this example. |
I am having similar problems with various WSL terminals. Gnome Terminal under an X-server or WSLg displays |
I too am having issues with this. Here's another specific example: from wcwidth import wcswidth
wcswidth("चाइनीज")
6 When viewing this using the Kitty terminal, the text only occupies 4 columns, not 6. This seems to be a problem in general with most Hindi text that I've encountered. Tested on macOS Big Sur 11.6.5 and Kitty 0.24.4 |
Major ----- Bugfix zero-with characters, closes #57, #47, #45, #39, #26, #25, #24, #22, #8, wow ! This is mostly achieved by replacing `ZERO_WIDTH_CF` with dynamic parsing by Category codes in bin/update-tables.py and putting those in the zero-wide tables. Tests ----- - `verify-table-integrity.py` exercises a "bug" of duplicated tables that has no effect, because wcswidth() first checks for zero-width, and that is preferred in cases of conflict. This PR also resolves that error of duplication. - new automatic tests for balinese, kr jamo, zero-width emoji, devanagari, tamil, kannada. - added pytest-benchmark plugin, example use: # baseline tox -epy312 -- --verbose --benchmark-save=original # compare tox -epy312 -- --verbose --benchmark-compare=.benchmarks/Linux-CPython-3.12-64bit/0001_original.json
Zero-width characters used with the Hindi language have been resolved in today's release by #91. I created a testing tool that verifies it, that at least in the case of "Universal Declaration of Human Rights" from https://unicode.org/udhr/ in Hindi, that wcwidth now agrees in measurement of all words with "kitty" and "mlterm" terminals. |
Just found another issue (perhaps a corner-case?) with Hindi: >>> from wcwidth import wcswidth
>>> wcswidth("गीत")
3 This sequence should only occupy 2 cells. |
@dscrofts thank you for your persistence, I really do appreciate your help with Hindi! Can you please check your version of wcwidth is the latest, Just to be sure, here is my test session, >>> import unicodedata, wcwidth
>>> wcwidth.__version__
'0.2.12'
>>> l='गीत'
>>> wcwidth.wcswidth(l)
2
>>> ', '.join([unicodedata.name(x).title() for x in l])
'Devanagari Letter Ga, Devanagari Vowel Sign Ii, Devanagari Letter Ta'
>>> [unicodedata.category(x) for x in l]
['Lo', 'Mc', 'Lo'] |
@jquast Thanks for the quick response! So I ran your code and get identical output. I think there was an issue copy/pasting the characters in my terminal that lead to the wrong output! After further investigation, it looks like my issue is rather with truncating the text. Specifically, if it is in the middle of a Hindi sequence. I try to left justify the text to a given width, but it seems to be breaking. This might be a good candidate for #93 to implement. I did see there was support for this previously? Any hints as to how I might go about handling/implementing this? |
Yes, there would be problems with breaking up a sequence that contains combining characters. It sounds like you are not writing a "left justify" function, but maybe a text wrapping function? If I write a "wc_ljust" function after the one in the readme, there is no opportunity for truncation. It only appends spaces to fill to the given width, just like the built-in str.ljust() or string formatting like def wc_ljust(text, length, padding=' '):
from wcwidth import wcswidth
return text + padding * max(0, (length - wcswidth(text))) The python textwrap module tries to break strings only at whitespace, but the default argument
We have only ever provided the "wc_rjust" example in the readme file for this project. In issue #93 I am referring to a terminal library of mine, blessed, that has these functions (ljust, rjust, center, and wrap). I think all of these functions would handle Hindi correctly by using >>> import blessed
>>> inp='क़ानून की निग़ाह में सभी समान हैं और सभी बिना भेदभाव के समान क़ानूनी सुरक्षा केस घोषणा का अतिक्रमण करके कोई भी भेद-भाव किया जाया उस प्रकार के भेद-भाव को किस, तो उसके विरुद्ध समान संरक्षण का अधिकार सभी को प्राप्त है ।'
>>> lines=blessed.Terminal().wrap(inp, 4, break_long_words=False)
>>> print('-|-'.join(lines)) # display word break locations with '-|-'
print('-|-'.join(lines))
क़ानून-|-की-|-निग़ाह-|-में सभी-|-समान-|-हैं और-|-सभी-|-बिना-|-भेदभाव-|-के-|-समान-|-क़ानूनी-|-सुरक्षा-|-के-|-अधिकारी-|-हैं ।-|-यदि-|-इस-|-घोषणा-|-का-|-अतिक्रमण-|-करके-|-कोई भी-|-भेद-भाव-|-किया-|-जाया-|-उस-|-प्रकार-|-के-|-भेद-भाव-|-को किसी-|-प्रकार-|-से-|-उकसाया-|-जाया,-|-तो-|-उसके-|-विरुद्ध-|-समान-|-संरक्षण-|-का-|-अधिकार-|-सभी को-|-प्राप्त-|-है ।
>>> print(list(map(wcwidth.wcswidth, lines))) # display length of each line
[3, 1, 3, 4, 3, 4, 2, 2, 4, 1, 3, 3, 4, 1, 4, 3, 2, 2, 3, 1, 6, 3, 4, 5, 2, 2, 2, 4, 1, 5, 4, 4, 1, 4, 3, 1, 3, 4, 3, 5, 1, 4, 4, 4, 3] Requesting the words to be broken at width of 4 with |
@jquast you are absolutely correct, I should be using text wrapping instead of ljust. In fact, a combination of both is what I need to have things line up correctly. Funnily enough I am already using blessed in my project, so now I ljust the wrap()'d text and all is working great. Thanks for the help and all your hard work with wcwidth and blessed :) |
I tried using wcwidth to calculate the length of the name for the city of Mumbai in Hindi (बॉम्बे हिंदी)
On macOS 10.13.5 using Python 3.6.5, I see a visual width of 5 characters and a calculated width of 9 characters.
On Ubuntu 18.04 using Python 3.6.5, I see a visual width of 9 characters and a calculated width of 9 characters.
Thank you by the way for creating a very useful module!
The text was updated successfully, but these errors were encountered: