Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode double-width character support #34

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion pyrepl/historical_reader.py
Original file line number Diff line number Diff line change
Expand Up @@ -256,7 +256,7 @@ def prepare(self):
def get_prompt(self, lineno, cursor_on_line):
if cursor_on_line and self.isearch_direction != ISEARCH_DIRECTION_NONE:
d = 'rf'[self.isearch_direction == ISEARCH_DIRECTION_FORWARDS]
return "(%s-search `%s') "%(d, self.isearch_term)
return u"(%s-search `%s') "%(d, self.isearch_term)
else:
return super(HistoricalReader, self).get_prompt(lineno, cursor_on_line)

Expand Down
66 changes: 46 additions & 20 deletions pyrepl/reader.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,15 +20,27 @@
# CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

from __future__ import unicode_literals
import sys
import unicodedata
from pyrepl import commands
from pyrepl import input
try:
unicode
def decode(x, enc = sys.stdout.encoding):
if not isinstance(x, unicode):
return unicode(x, enc)
return x
except NameError:
unicode = str
unichr = chr
basestring = bytes, str
decode = lambda x, _ = None: x


def width(c):
return 2 if unicodedata.east_asian_width(c) in "FW" else 1
def wlen(s):
return sum(map(width, s))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about using https://pypi.org/project/wcwidth/?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking in from the PyPy side, wcwidth seems like a small package with JIT-freindly structure (no complicated classes or deep function chains). It was dormant from 2016 to 2020, but seems to be active again. The license is MIT, which is fine for PyPy.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that that bit of code could be improved, but what would be the advantage of using wcwidth over python's built-in unicode support? It seems to provide mostly the same functionality except with its own unicode tables that need to be kept up to date.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wcwidth can also return 0 or -1, which the current code doesn't know how to handle. And I'm not sure what the best way to handle those would be tbh.

For improved performance, we could use e.g. @functools.lru_cache.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(note that pyrepl already shows e.g. control characters as ^c etc.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is possible to find the Unicode version of a terminal by somewhat intrusive detection https://github.com/jquast/ucs-detect

According to the table in the README, lxterminal and kitty support the same unicode version (11.0.0).
But as my screenshot shows, they don't display emoji the same way.

(Slightly OT: I'd consider switching to kitty, but unfortunately it doesn't support Japanese input 😿)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I briefly tested using wcwidth: obfusk/pyrepl-wide@eba676a.

  • I don't notice any improvement atm.
  • I had to modify the original prompt length calculation (because that subtracts the length of ansi escape sequences).
  • _my_unctrl() currently turns a zero-width joiner into "\u200d" and I'm not sure returning it unmodified won't break something else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blueyed is there a suspicion that this PR is worse in some cases than the current code? If not maybe it can be merged as-is and other alternatives could be parts of future PRs?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with wcwidth's upcoming width() function (and setting aside the differences between terminals), handling zero-width joiners etc. properly is not easy, since width(s) == sum(width(c) for c in s) is no longer true for all strings. This would presumably require more careful and extensive modifications to the code than the current PR.

I think wcwidth is likely to be quite useful in improving unicode support even more in the future, but for now I think this PR is a good first step. Further improvements get a lot more complicated and should probably be separate, future PRs.

I'd also point out that readline doesn't really seem to handle emoji any better.

AFAIK this PR doesn't make anything worse. Unfortunately (but understandably since I don't really know a good way to test all the curses stuff), the test suite can't rule that out.

I do however think the PR is still a little rough and could use a code review and probably some updates to comments and documentation before merging.

Copy link
Author

@obfusk obfusk Feb 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking in from the PyPy side, wcwidth seems like a small package with JIT-freindly structure (no complicated classes or deep function chains). It was dormant from 2016 to 2020, but seems to be active again. The license is MIT, which is fine for PyPy.

I don't think wcwidth supports python2 though, which I presume is an issue for PyPy. [edit: I was mistaken]



def _make_unctrl_map():
Expand All @@ -39,8 +51,8 @@ def _make_unctrl_map():
for i in range(32):
c = unichr(i)
uc_map[c] = '^' + unichr(ord('A') + i - 1)
uc_map[b'\t'] = ' ' # display TABs as 4 characters
uc_map[b'\177'] = unicode('^?')
uc_map['\t'] = ' ' # display TABs as 4 characters
uc_map['\177'] = unicode('^?')
for i in range(256):
c = unichr(i)
if c not in uc_map:
Expand All @@ -53,7 +65,7 @@ def _my_unctrl(c, u=_make_unctrl_map()):
return u[c]
else:
if unicodedata.category(c).startswith('C'):
return br'\u%04x' % ord(c)
return '\\u%04x' % ord(c)
else:
return c

Expand All @@ -75,7 +87,7 @@ def disp_str(buffer, join=''.join, uc=_my_unctrl):
s = [uc(x) for x in buffer]
b = [] # XXX: bytearray
for x in s:
b.append(1)
b.append(width(x[0]))
b.extend([0] * (len(x) - 1))
return join(s), b

Expand Down Expand Up @@ -280,7 +292,7 @@ def calc_screen(self):
for mline in self.msg.split("\n"):
screen.append(mline)
screeninfo.append((0, []))
self.lxy = p, ln
# self.lxy = p, ln
prompt = self.get_prompt(ln, ll >= p >= 0)
while '\n' in prompt:
pre_prompt, _, prompt = prompt.partition('\n')
Expand All @@ -289,18 +301,29 @@ def calc_screen(self):
p -= ll + 1
prompt, lp = self.process_prompt(prompt)
l, l2 = disp_str(line)
wrapcount = (len(l) + lp) // w
wrapcount = (wlen(l) + lp) // w
if wrapcount == 0:
screen.append(prompt + l)
screeninfo.append((lp, l2 + [1]))
else:
screen.append(prompt + l[:w - lp] + "\\")
screeninfo.append((lp, l2[:w - lp]))
for i in range(-lp + w, -lp + wrapcount * w, w):
screen.append(l[i:i + w] + "\\")
screeninfo.append((0, l2[i:i + w]))
screen.append(l[wrapcount * w - lp:])
screeninfo.append((0, l2[wrapcount * w - lp:] + [1]))
for i in range(wrapcount + 1):
s = lp if i == 0 else 0
r = w - s
j = 0
while j < len(l2):
n = l2[j] or 1
if n > r:
break
else:
r -= n
j += 1
pre = prompt if i == 0 else ""
post = "" if i == wrapcount else "\\"
after = [1] if i == wrapcount else []
screen.append(pre + l[:j] + post)
screeninfo.append((s, l2[:j] + after))
l = l[j:]
l2 = l2[j:]
self.screeninfo = screeninfo
self.cxy = self.pos2xy(self.pos)
if self.msg and self.msg_at_bottom:
Expand All @@ -318,7 +341,7 @@ def process_prompt(self, prompt):
is returned with these control characters removed. """

out_prompt = ''
l = len(prompt)
l = wlen(prompt)
pos = 0
while True:
s = prompt.find('\x01', pos)
Expand Down Expand Up @@ -420,7 +443,7 @@ def get_prompt(self, lineno, cursor_on_line):
# the object on which str() was called. This ensures that even if the
# same object is used e.g. for ps1 and ps2, str() is called only once.
if res not in self._pscache:
self._pscache[res] = str(res)
self._pscache[res] = decode(res)
return self._pscache[res]

def push_input_trans(self, itrans):
Expand All @@ -438,23 +461,26 @@ def pos2xy(self, pos):
if pos == len(self.buffer):
y = len(self.screeninfo) - 1
p, l2 = self.screeninfo[y]
return p + len(l2) - 1, y
return p + sum(l2) + l2.count(0) - 1, y
else:
for p, l2 in self.screeninfo:
l = l2.count(1)
l = len(l2) - l2.count(0)
if l > pos:
break
else:
pos -= l
y += 1
c = 0
i = 0
while c < pos:
c += l2[i]
j = 0
while j < pos:
j += 1 if l2[i] else 0
c += l2[i] or 1
i += 1
while l2[i] == 0:
c += 1
i += 1
return p + i, y
return p + c, y

def insert(self, text):
"""Insert 'text' at the insertion point."""
Expand Down
70 changes: 44 additions & 26 deletions pyrepl/unix_console.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
import re
import time
import sys
import unicodedata
from fcntl import ioctl
from . import curses
from .fancy_termios import tcgetattr, tcsetattr
Expand All @@ -44,6 +45,13 @@ class InvalidTerminal(RuntimeError):
except NameError:
unicode = str


def width(c):
return 2 if unicodedata.east_asian_width(c) in "FW" else 1
def wlen(s):
return sum(map(width, s))


_error = (termios.error, curses.error, InvalidTerminal)

# there are arguments for changing this to "refresh"
Expand Down Expand Up @@ -247,46 +255,56 @@ def __write_changed_line(self, y, oldline, newline, px):
# structuring this function are equally painful (I'm trying to
# avoid writing code generators these days...)
x = 0
minlen = min(len(oldline), len(newline))
i = 0
minlen = min(wlen(oldline), wlen(newline))
pi = 0
xx = 0
for c in oldline:
xx += width(c)
pi += 1
if xx >= px: break
#
# reuse the oldline as much as possible, but stop as soon as we
# encounter an ESCAPE, because it might be the start of an escape
# sequene
#XXX unicode check!
while x < minlen and oldline[x] == newline[x] and newline[x] != '\x1b':
x += 1
if oldline[x:] == newline[x+1:] and self.ich1:
if (y == self.__posxy[1] and x > self.__posxy[0] and
oldline[px:x] == newline[px+1:x+1]):
while x < minlen and oldline[i] == newline[i] and newline[i] != '\x1b':
x += width(newline[i])
i += 1
if oldline[i:] == newline[i+1:] and self.ich1:
if (y == self.__posxy[1] and x > self.__posxy[0]
and oldline[pi:i] == newline[pi+1:i+1]):
i = pi
x = px
self.__move(x, y)
self.__write_code(self.ich1)
self.__write(newline[x])
self.__posxy = x + 1, y
elif x < minlen and oldline[x + 1:] == newline[x + 1:]:
cw = width(newline[i])
self.__write_code(cw*self.ich1)
self.__write(newline[i])
self.__posxy = x + cw, y
elif (x < minlen and oldline[i+1:] == newline[i+1:]
and width(oldline[i]) == width(newline[i])):
self.__move(x, y)
self.__write(newline[x])
self.__posxy = x + 1, y
elif (self.dch1 and self.ich1 and len(newline) == self.width
and x < len(newline) - 2
and newline[x+1:-1] == oldline[x:-2]):
self.__write(newline[i])
self.__posxy = x + width(newline[i]), y
elif (self.dch1 and self.ich1 and wlen(newline) == self.width
and x < wlen(newline) - 2
and newline[i+1:] == oldline[i:-1]):
cw = width(newline[i])
self.__hide_cursor()
self.__move(self.width - 2, y)
self.__posxy = self.width - 2, y
self.__write_code(self.dch1)
self.__move(self.width - cw, y)
self.__posxy = self.width - cw, y
self.__write_code(cw*self.dch1)
self.__move(x, y)
self.__write_code(self.ich1)
self.__write(newline[x])
self.__posxy = x + 1, y
self.__write_code(cw*self.ich1)
self.__write(newline[i])
self.__posxy = x + cw, y
else:
self.__hide_cursor()
self.__move(x, y)
if len(oldline) > len(newline):
if wlen(oldline) > wlen(newline):
self.__write_code(self._el)
self.__write(newline[x:])
self.__posxy = len(newline), y
self.__write(newline[i:])
self.__posxy = wlen(newline), y

#XXX: check for unicode mess
if '\x1b' in newline:
# ANSI escape characters are present, so we can't assume
# anything about the position of the cursor. Moving the cursor
Expand Down
2 changes: 1 addition & 1 deletion testing/infrastructure.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ class TestReader(Reader):
__test__ = False

def get_prompt(self, lineno, cursor_on_line):
return ''
return u''

def refresh(self):
Reader.refresh(self)
Expand Down