Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode: '鿛' is categorized as unassigned codepoint #8748

Closed
g-andrade opened this issue Aug 26, 2024 · 2 comments · Fixed by #8752
Closed

Unicode: '鿛' is categorized as unassigned codepoint #8748

g-andrade opened this issue Aug 26, 2024 · 2 comments · Fixed by #8752
Assignees
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM

Comments

@g-andrade
Copy link
Contributor

Describe the bug

The undocumented function unicode_util:lookup/1 - which I'm not supposed to use - categorizes as "Other / not assigned" (Cn) instead of the expected "Other letter" (Lo):

Being an undocumented function may justify closing this issue right away, but I thought I should report it as it may not be the intended internal behaviour.

To Reproduce

% unicode_util:lookup($鿛).
#{category => {other,not_assigned},
  canon => [],ccc => 0,compat => []}

Expected behavior

% unicode_util:lookup($鿛).
#{category => {letter,other},
  canon => [],ccc => 0,compat => []}

Affected versions

OTP 26.2.5.2

@g-andrade g-andrade added the bug Issue is reported as a bug label Aug 26, 2024
@IngelaAndin IngelaAndin added the team:VM Assigned to OTP team VM label Aug 27, 2024
@dgud
Copy link
Contributor

dgud commented Aug 27, 2024

While looking for clues I stumbled upon that UnicodeData.txt could contain ranges, so it was an easy fix.

For backward compatibility, ranges in the file UnicodeData.txt are specified by entries for the start and end characters of the range, rather than by the form "X..Y". The start character is indicated by a range identifier, followed by a comma and the string "First", in angle brackets. This entry takes the place of a regular character name in field 1 for that line. The end character is indicated on the next line with the same range identifier, followed by a comma and the string "Last", in angle brackets:

@dgud dgud self-assigned this Aug 27, 2024
dgud added a commit that referenced this issue Sep 9, 2024
* dgud/stdlib/unicode-fix/GH-8748/OTP-19210:
  Handle ranges in UnicodeData.txt
@dgud
Copy link
Contributor

dgud commented Oct 22, 2024

Merged

@dgud dgud closed this as completed Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants
@dgud @IngelaAndin @g-andrade and others