Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-codepoint emojis #39

Closed
willmcgugan opened this issue Jun 9, 2020 · 11 comments
Closed

Multi-codepoint emojis #39

willmcgugan opened this issue Jun 9, 2020 · 11 comments
Labels

Comments

@willmcgugan
Copy link

Hi,

Can wcwidth help me with multi-codepoint emojis?

For instance, here I want to get the cell width for a "woman_mechanic_dark_skin_tone" emoji, which renders in the terminal as 2 cells, but wcswidth reports a width of 6 because it is adding up all the modifiers.

>>> s="👩\U+1F3FF\u200d🔧"
>>> print(repr(s))
'👩🏿\u200d🔧'
>>> from wcwidth import wcswidth
>>> wcswidth(s)
6
>>> print(s+"\n--")
👩🏿‍🔧
--

I've found support for these kind of emojis to be inconsistent across terminals, so maybe this is a lost cause, but is there some kind of standard for these emoji modifiers?

@jquast
Copy link
Owner

jquast commented Jun 10, 2020

I think wc/swidth should help somehow, yes. These didn’t exist in the first release of wcwidth.c this code is based upon, and since updating for new specs, I just failed to take parse them from the data files or otherwise tske them into account. This is a bug/missing feature, thanks!

@jquast jquast added the bug label Jun 10, 2020
@willmcgugan
Copy link
Author

That's great, thanks.

Hope this doesn't complicate things too much. I've been learning about how these emoji are encoded, and all I can say is yuck.

You might know this already... there is a skin tone modifier which changes the skin tone of the preceding emoji and would have zero width. But it can also appear by itself and is rendered as a colored box if not preceded by an emoji taking up 2 cell widths (at least on iterm). That can be followed by a "zero width joiner" character which attaches another codepoint. In my first example that would be a wrench symbol, which makes the emoji a mechanic. All this was gleaned from https://emojipedia.org/

@jquast
Copy link
Owner

jquast commented Sep 7, 2020

I began to draft some code for this purpose a bit ago, pushed branch https://github.com/jquast/wcwidth/tree/emoji-zwj

I think the hardest parts are done (parsing unicode data files for emoji ZWJ),WIP

@tonycpsu
Copy link

@jquast any update on this WIP? I was going to see if I could move the ball forward, but when I try your branch, I get:

ModuleNotFoundError: No module named 'wcwidth.table_emoji_zjw

Looks like the file containing the table wasn't checked in.

@jquast
Copy link
Owner

jquast commented Jan 17, 2021

Try running tox, the tables are made by code generation, I think it is documented. I do hope to resume this issue in the next month or so, thanks for your interest

@jquast
Copy link
Owner

jquast commented Jan 17, 2021

bin/update-tables.py

@DragonRulerX
Copy link

I just pulled wcwidth for the first time today when using tabulate in python.
I decided to dive in to that code and found that tabulate relies heavily on this library.
So, I figured I may as well post this here as well just in case it helps with visibility of the issue
astanin/python-tabulate#108

@jquast
Copy link
Owner

jquast commented Jan 29, 2021

I think that wcswidth returning -1 for any non-printables/determinables have caused folks to rely on cheats, like sum(max(0, wcwdith(u)) for u in unicode-string), and the problem with that, is we wouldn’t be able to determine multi-code point emoji lengths,

the -1 return value is probably not a good idea for Python, it’s simply an API compatible with all other wcswidth implementations.

This WIP branch proposes a new API function, wcswidth.width that just does its best to return the width of a full string, no -1 return ability. If a control character like \n or \t is in there, we just ignore it, downstream libraries will have to do their own checks and measures for that.

As a new function, we remain API compatable, but downstream libraries will want to use the new function for this feature, which I’ll probably also try to submit to the top 10 or so.

@DragonRulerX
Copy link

I'm a little confused. Are you saying there is a fix for the issue I linked above or that this is still a WIP?
I'm hoping to either patch in the fix myself if there is one or to pull down the new library update when it's available.

@tonycpsu
Copy link

Any updates here? A lot of downstream projects looking for a fix.

jquast added a commit that referenced this issue Oct 30, 2023
Major
-----

Bugfix zero-with characters, closes #57, #47, #45, #39, #26, #25, #24, #22, #8, wow !

This is mostly achieved by replacing `ZERO_WIDTH_CF` with dynamic parsing by Category codes in bin/update-tables.py and putting those in the zero-wide tables.

Tests
-----

- `verify-table-integrity.py` exercises a "bug" of duplicated tables that has no effect, because wcswidth() first checks for zero-width, and that is preferred in cases of conflict. This PR also resolves that error of duplication.
- new automatic tests for balinese, kr jamo, zero-width emoji, devanagari, tamil, kannada.  
- added pytest-benchmark plugin, example use:

        # baseline
        tox -epy312 -- --verbose --benchmark-save=original
        # compare
        tox -epy312 -- --verbose --benchmark-compare=.benchmarks/Linux-CPython-3.12-64bit/0001_original.json
@jquast
Copy link
Owner

jquast commented Oct 30, 2023

Fixed by #91 in today's release.

I also wrote a tool to test terminals for Emoji ZWJ for anyone interested, https://pypi.org/project/ucs-detect/

@jquast jquast closed this as completed Oct 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants