Multi-codepoint emojis #39

willmcgugan · 2020-06-09T22:17:24Z

Hi,

Can wcwidth help me with multi-codepoint emojis?

For instance, here I want to get the cell width for a "woman_mechanic_dark_skin_tone" emoji, which renders in the terminal as 2 cells, but wcswidth reports a width of 6 because it is adding up all the modifiers.

>>> s="👩\U+1F3FF\u200d🔧"
>>> print(repr(s))
'👩🏿\u200d🔧'
>>> from wcwidth import wcswidth
>>> wcswidth(s)
6
>>> print(s+"\n--")
👩🏿‍🔧
--

I've found support for these kind of emojis to be inconsistent across terminals, so maybe this is a lost cause, but is there some kind of standard for these emoji modifiers?

jquast · 2020-06-10T01:36:17Z

I think wc/swidth should help somehow, yes. These didn’t exist in the first release of wcwidth.c this code is based upon, and since updating for new specs, I just failed to take parse them from the data files or otherwise tske them into account. This is a bug/missing feature, thanks!

willmcgugan · 2020-06-10T11:35:10Z

That's great, thanks.

Hope this doesn't complicate things too much. I've been learning about how these emoji are encoded, and all I can say is yuck.

You might know this already... there is a skin tone modifier which changes the skin tone of the preceding emoji and would have zero width. But it can also appear by itself and is rendered as a colored box if not preceded by an emoji taking up 2 cell widths (at least on iterm). That can be followed by a "zero width joiner" character which attaches another codepoint. In my first example that would be a wrench symbol, which makes the emoji a mechanic. All this was gleaned from https://emojipedia.org/

jquast · 2020-09-07T13:55:09Z

I began to draft some code for this purpose a bit ago, pushed branch https://github.com/jquast/wcwidth/tree/emoji-zwj

I think the hardest parts are done (parsing unicode data files for emoji ZWJ),WIP

tonycpsu · 2021-01-17T20:36:42Z

@jquast any update on this WIP? I was going to see if I could move the ball forward, but when I try your branch, I get:

ModuleNotFoundError: No module named 'wcwidth.table_emoji_zjw

Looks like the file containing the table wasn't checked in.

jquast · 2021-01-17T20:38:57Z

Try running tox, the tables are made by code generation, I think it is documented. I do hope to resume this issue in the next month or so, thanks for your interest

jquast · 2021-01-17T20:40:39Z

bin/update-tables.py

DragonRulerX · 2021-01-21T07:09:21Z

I just pulled wcwidth for the first time today when using tabulate in python.
I decided to dive in to that code and found that tabulate relies heavily on this library.
So, I figured I may as well post this here as well just in case it helps with visibility of the issue
astanin/python-tabulate#108

jquast · 2021-01-29T15:37:46Z

I think that wcswidth returning -1 for any non-printables/determinables have caused folks to rely on cheats, like sum(max(0, wcwdith(u)) for u in unicode-string), and the problem with that, is we wouldn’t be able to determine multi-code point emoji lengths,

the -1 return value is probably not a good idea for Python, it’s simply an API compatible with all other wcswidth implementations.

This WIP branch proposes a new API function, wcswidth.width that just does its best to return the width of a full string, no -1 return ability. If a control character like \n or \t is in there, we just ignore it, downstream libraries will have to do their own checks and measures for that.

As a new function, we remain API compatable, but downstream libraries will want to use the new function for this feature, which I’ll probably also try to submit to the top 10 or so.

DragonRulerX · 2021-02-04T09:26:28Z

I'm a little confused. Are you saying there is a fix for the issue I linked above or that this is still a WIP?
I'm hoping to either patch in the fix myself if there is one or to pull down the new library update when it's available.

tonycpsu · 2021-11-30T05:45:30Z

Any updates here? A lot of downstream projects looking for a fix.

Major ----- Bugfix zero-with characters, closes #57, #47, #45, #39, #26, #25, #24, #22, #8, wow ! This is mostly achieved by replacing `ZERO_WIDTH_CF` with dynamic parsing by Category codes in bin/update-tables.py and putting those in the zero-wide tables. Tests ----- - `verify-table-integrity.py` exercises a "bug" of duplicated tables that has no effect, because wcswidth() first checks for zero-width, and that is preferred in cases of conflict. This PR also resolves that error of duplication. - new automatic tests for balinese, kr jamo, zero-width emoji, devanagari, tamil, kannada. - added pytest-benchmark plugin, example use: # baseline tox -epy312 -- --verbose --benchmark-save=original # compare tox -epy312 -- --verbose --benchmark-compare=.benchmarks/Linux-CPython-3.12-64bit/0001_original.json

jquast · 2023-10-30T19:11:05Z

Fixed by #91 in today's release.

I also wrote a tool to test terminals for Emoji ZWJ for anyone interested, https://pypi.org/project/ucs-detect/

jquast added the bug label Jun 10, 2020

rybarczykj mentioned this issue Jul 9, 2020

Corrected cursor positioning on inputs with full-width (2 column) characters bpython/bpython#817

Merged

jquast mentioned this issue Aug 22, 2020

Variation selectors are not correctly handled #45

Closed

tonycpsu mentioned this issue Jan 18, 2021

LineBox rline not positioned correctly around Text containing symbols from Unicode block 'Miscellaneous Symbols and Pictographs' urwid/urwid#225

Open

dankamongmen mentioned this issue Feb 5, 2021

combining emoji aren't correctly sized dankamongmen/notcurses#1329

Open

jquast mentioned this issue Feb 25, 2021

unicode double-width character support pypy/pyrepl#34

Open

polm mentioned this issue Jun 26, 2021

Display errors with emoji of various widths saulpw/visidata#758

Closed

JoshuaWise mentioned this issue Sep 23, 2021

Support for multi-codepoint emojis timoxley/wcwidth#8

Open

GalaxySnail mentioned this issue Feb 4, 2023

Devanagari's zero-width characters are not accounted for properly #47

Closed

jquast mentioned this issue Oct 19, 2023

Bugfixes for zero-width characters #91

Merged

jquast closed this as completed Oct 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-codepoint emojis #39

Multi-codepoint emojis #39

willmcgugan commented Jun 9, 2020

jquast commented Jun 10, 2020

willmcgugan commented Jun 10, 2020

jquast commented Sep 7, 2020

tonycpsu commented Jan 17, 2021

jquast commented Jan 17, 2021

jquast commented Jan 17, 2021

DragonRulerX commented Jan 21, 2021

jquast commented Jan 29, 2021

DragonRulerX commented Feb 4, 2021

tonycpsu commented Nov 30, 2021

jquast commented Oct 30, 2023

Multi-codepoint emojis #39

Multi-codepoint emojis #39

Comments

willmcgugan commented Jun 9, 2020

jquast commented Jun 10, 2020

willmcgugan commented Jun 10, 2020

jquast commented Sep 7, 2020

tonycpsu commented Jan 17, 2021

jquast commented Jan 17, 2021

jquast commented Jan 17, 2021

DragonRulerX commented Jan 21, 2021

jquast commented Jan 29, 2021

DragonRulerX commented Feb 4, 2021

tonycpsu commented Nov 30, 2021

jquast commented Oct 30, 2023