Add basic Unicode support #169

MadCatX · 2024-02-25T18:56:58Z

This patch adds basic Unicode support to the UI. It relies on Freetype font renderer and is hidden by default behind UI_UNICODE flag. Encoding support is currently limited to UTF-8.

What works:

Display of all Unicode characters. Tested against (https://github.com/bits/UTF-8-Unicode-Test-Documents/blob/master/UTF-8_sequence_separated/utf8_sequence_0-0x10ffff_assigned_including-unprintable-asis.txt) and (https://github.com/bits/UTF-8-Unicode-Test-Documents/blob/master/UTF-8_sequence_separated/utf8_sequence_0-0x10ffff_assigned_including-unprintable-replaced.txt)
Handling of invalid UTF-8 input. Invalid characters are replaced with ? without any other adverse effects.
Navigation in text using keyboard and mouse, selection, copy-paste, replacements.
Starting executables from paths with Unicode characters. The path currently needs to be pasted into the textbox.

What could be improved:

Word-skipping in text with Unicode characters is a bit flaky. To fix this, I suppose that the IsCharAlpha() function would have to be much more sophisticated.
Cursor position selection with mouse and invalid UTF-8 input. Sometimes the cursor is positioned a few characters behind the expected position if the expected position lies inside an invalid UTF-8 sequence. Navigation with keyboard seems to work fine.

What does not work:

Textboxes currently do not accept Unicode input from keyboard keystrokes. A workaround is to copy-paste it in.

Notes:

This patch changes the meaning of carets in UIStringSelection to by how many glyphs the caret is offset from the beginning rather than bytes.
This patch also fixes a minor rendering bug when a text selection in UITextbox contains tabs.

Comments welcome :)

nakst · 2024-02-27T16:25:29Z

Thanks for making this, I'll review it properly when I have time (I'm a bit busy at the moment).

nakst

Finally I got around to making the review. Sorry about the wait.
I have a lot of comments about style here. I have quite strong opinions about code style, unfortunately. Usually when someone makes a pull request I will make the code style changes myself, but I am too busy to do that at the moment.
I think it's best if before the UI_UNICODE changes are merged in, you could address some of the things you have mentioned as things to be improved, like cursor positioning with the mouse.
Thank you.

nakst · 2024-03-10T18:52:29Z

luigi2.h

+#error "Unicode support requires Freetype"
+#endif
+
+#define UNICODE_MAX_CODEPOINT 0x10FFFF


All definitions internal to the UI library should be prefixed with _UI, since it is intended to be used as a single-file-header library.

nakst · 2024-03-10T18:53:05Z

luigi2.h

+#define UNICODE_MAX_CODEPOINT 0x10FFFF
+
+bool _Utf8ApplyExtendedByte(int byte, int *codePoint, int shift)
+{


Code style: The bracket should be placed on the same line as the function arguments.

nakst · 2024-03-10T18:53:27Z

luigi2.h

+		return -1;
+	}
+
+	int codePoint = int(first & (0x3F >> numExtraBytes)) << (6 * numExtraBytes);


Code style: Use (int) x to cast.

nakst · 2024-03-10T18:55:13Z

luigi2.h

+
+	int codePoint = int(first & (0x3F >> numExtraBytes)) << (6 * numExtraBytes);
+	for (ptrdiff_t idx = 1; idx < numExtraBytes + 1; idx++) {
+		if (!_Utf8ApplyExtendedByte(cString[idx], &codePoint, numExtraBytes - idx)) {


_Utf8ApplyExtendedByte really doesn't need to be a separate function here, and this doesn't need to be done in a loop. Just write out the bit-operations.

nakst · 2024-03-10T18:56:13Z

luigi2.h

+int Utf8GetCodePoint2(const char *cString, ptrdiff_t bytesLength)
+{
+	ptrdiff_t bytesConsumed;
+	return Utf8GetCodePoint(cString, bytesLength, &bytesConsumed);


Instead of adding a second function, change the semantics of Utf8GetCodePoint so that if NULL is passed for the bytesConsumed pointer, then it does not write to it.

nakst · 2024-03-10T18:56:44Z

luigi2.h

+	}
+
+	ptrdiff_t length = 0;
+	ptrdiff_t byteIdx = 0;


Code style: rename to byteIndex.

nakst · 2024-03-10T18:59:36Z

luigi2.h

 		ti++;
 		if (code->content[byte + code->lines[line].offset] == '\t') while (ti % code->tabSize) ti++;
 		if (column < ti) break;
+
+#ifdef UI_UNICODE


All these #ifdefs get quite messy. It would make more sense to have a few macros that handle the basic string operations and the definition of those macros depend on whether UI_UNICODE is defined.

nakst · 2024-03-10T19:00:09Z

luigi2.h


 		last <<= 8;
-		last |= c;
+		last |= (char)c;


Code style: add a whitespace between the (cast) type and the expression.

nakst · 2024-03-10T19:00:56Z

luigi2.h

@@ -3041,6 +3200,48 @@ int _UICodeMessage(UIElement *element, UIMessage message, int di, void *dp) {
 	return 0;
 }

+#if UI_UNICODE
+void UICodeMoveCaret(UICode *code, bool backward, bool word) {


Again, if you have separate defines for the string operations I think you won't need to duplicate this entire function's logic?

MadCatX · 2024-03-14T07:42:25Z

Thanks for reviewing the PR. Don't worry about the coding style change requests, it's your code and your rules. I'll see what I can do in the upcoming days as I'm a bit busy myself ATM.

MadCatX · 2024-04-18T23:03:40Z

My apologies for taking so long to get back to this. I addressed the coding style issues and factored out some parts of functions to macros to avoid the convoluted ifdefs.

I still need to investigate the cursor positioning issue. It appears that I can repro it even with unicode handling switched off. Hopefully I should have more time to work on this now...

placement after a TAB character

MadCatX · 2024-04-23T18:04:00Z

I think I figured out the cursor placement issue. Let me know if you'd like me to revise the code further or do some additional tests.

nakst · 2024-05-02T18:53:43Z

Thanks, I'll try to take a look at the weekend.

nakst

Please see the attached comments. The algorithms are looking good, but there is a lot of duplicated code at the moment which needs to be simplified before these changes can be merged. Thank you!

nakst · 2024-05-06T12:05:27Z

luigi2.h

 		ti++;
-		if (code->content[i + code->lines[line].offset] == '\t') while (ti % code->tabSize) ti++;
+		_UI_ADVANCE_COLUMN(i, code, byte);


This needs to take into accounts tab characters, as the old code did.

nakst · 2024-05-06T12:07:04Z

luigi2.h


 		last <<= 8;
-		last |= c;
+		last |= (char) c;


Writing out the & 0xFF would make this a bit clearer.

nakst · 2024-05-06T12:07:54Z

luigi2.h

+	while (bytes) {
+#ifdef UI_UNICODE
+		ptrdiff_t bytesConsumed;
+		int c = Utf8GetCodePoint(string, bytes, &bytesConsumed);


Instead of having #ifdef UI_UNICODE everywhere, please merge this logic into a single macro which can be used here, in place of _UI_ADVANCE_BYTE and in place of _UI_ADVANCE_COLUMN.

nakst · 2024-05-06T12:09:11Z

luigi2.h

+#ifdef UI_UNICODE
+
+#define _UI_TEXTBOX_MOVE_CARET_BACKWARD(textbox) do { \
+	char *prev = Utf8GetPreviousChar(textbox->string, textbox->string + textbox->carets[0]); \


This logic should not be duplicated with the _UI_CODE_MOVE_ functions. There should be a common macro for doing all of this.

from column to byte for UITextbox

bytes

MadCatX · 2024-05-08T14:02:24Z

Thanks for the comments. I've generalized cursor positioning and byte<->column translation functions. Hopefully I've also fixed an issue where the code would assert on an attempt to get a code point from an empty string. As far as I could test things seem to behave correctly.

MadCatX · 2024-06-15T21:20:49Z

I'm sorry for the ping... is there anything else you'd like me to revise?

nakst · 2024-06-16T17:32:07Z

Sorry, I have it on my list of things to-do to review these changes. But I have been busy/tired lately, so I haven't done it yet.

nakst · 2024-08-21T12:31:31Z

Thanks for submitting these changes!

Add basic Unicode support

48384e4

Consider A-Z, a-z and everything above code point 255 an alpha character

08468f9

nakst reviewed Mar 10, 2024

View reviewed changes

MadCatX added 3 commits April 18, 2024 19:39

Address code style issues

e03e159

Move the index advancing logic to a macro

05113b0

Factor out some cursor-positioning code to macros

3173ef2

Finish macrofication of the Unicode functions and fix wrong caret

062ba9b

placement after a TAB character

nakst reviewed May 6, 2024

View reviewed changes

MadCatX added 4 commits May 8, 2024 14:47

Generalize macros that handle cursor positioning and provide translation

ac05a97

from column to byte for UITextbox

Generalize translations between bytes and columns

5858f57

Guard against byte position pointing outside the available number of

8330b43

bytes

Fix cursor positioning macros for non-unicode build

64bddf5

nakst merged commit 583c3b7 into nakst:master Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add basic Unicode support #169

Add basic Unicode support #169

MadCatX commented Feb 25, 2024

nakst commented Feb 27, 2024

nakst left a comment

nakst Mar 10, 2024

nakst Mar 10, 2024

nakst Mar 10, 2024

nakst Mar 10, 2024

nakst Mar 10, 2024

nakst Mar 10, 2024

nakst Mar 10, 2024

nakst Mar 10, 2024

nakst Mar 10, 2024

MadCatX commented Mar 14, 2024

MadCatX commented Apr 18, 2024

MadCatX commented Apr 23, 2024

nakst commented May 2, 2024

nakst left a comment

nakst May 6, 2024

nakst May 6, 2024

nakst May 6, 2024

nakst May 6, 2024

MadCatX commented May 8, 2024

MadCatX commented Jun 15, 2024

nakst commented Jun 16, 2024

nakst commented Aug 21, 2024

Add basic Unicode support #169

Add basic Unicode support #169

Conversation

MadCatX commented Feb 25, 2024

nakst commented Feb 27, 2024

nakst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MadCatX commented Mar 14, 2024

MadCatX commented Apr 18, 2024

MadCatX commented Apr 23, 2024

nakst commented May 2, 2024

nakst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MadCatX commented May 8, 2024

MadCatX commented Jun 15, 2024

nakst commented Jun 16, 2024

nakst commented Aug 21, 2024