Add full support for `HIGHCHARUNICODE` #397

fourls · 2025-07-23T03:39:20Z

Fixes #368 by improving TextLiteralNode::characterEscapeToChar. Note that the interpretation of "ANSI encoding" is the system native encoding, so it is assumed that the system the Sonar scan runs on has the same native encoding as the system compiling the code.

fourls · 2025-07-23T03:40:49Z

One thing I forgot to note - I removed support for binary character escapes (e.g. #%11101110) since as far as I can tell Delphi does not recognise them.

zaneduffield · 2025-07-23T03:46:09Z

The system encoding isn't what the compiler is using. The compiler is using the configured codepage, which may be 0, which means the system encoding.

zaneduffield · 2025-07-23T03:47:50Z

Actually... let me check that.

zaneduffield · 2025-07-23T04:04:32Z

It's a bit subtle, but I was right that the compiler does use the configured codepage for this kind of thing.

The reason it's subtle, is that if the type of the variable being assigned to is AnsiChar (or AnsiString), then the codepage at compile-time is irrelevant, because the bytes will be stored as written and interpreted at runtime. However if the type of the variable being assigned to is WideChar (or WideString), then any ANSI bytes from compile-time would have been converted to Unicode at compile time using the configured codepage.

I'm not sure how exactly this affects the change here, but I would expect the configured codepage to become involved at some point.

cirras · 2025-07-25T03:25:47Z

delphi-frontend/src/main/java/au/com/integradev/delphi/antlr/ast/node/TextLiteralNodeImpl.java

+  private char characterEscapeToChar(String image) {
    image = image.substring(1);
    int radix = 10;

-    switch (image.charAt(0)) {
-      case '$':
-        radix = 16;
-        image = image.substring(1);
-        break;
-      case '%':
-        radix = 2;
-        image = image.substring(1);
-        break;
-      default:
-        // do nothing
+    if (image.charAt(0) == '$') {
+      radix = 16;
+      image = image.substring(1);
    }


Let's keep the handling for binary character escapes even though they're currently not syntactically valid.
Our grammar allows it, and it's arguably a bug that the compiler doesn't currently recognize them. I think it's likely a future Delphi version will fix that.

cirras · 2025-07-25T03:33:29Z

delphi-frontend/src/main/java/au/com/integradev/delphi/antlr/ast/node/TextLiteralNodeImpl.java

+    if (isHighCharUnicode() || character > 255) {
+      // With HIGHCHARUNICODE ON, all escapes are interpreted as UTF-16.
+      // Escapes above 255 are always interpreted as UTF-16.
+      return character;
+    } else {
+      // With HIGHCHARUNICODE OFF, escapes between 0-255 are interpreted in the system code page.


According to the HIGHCHARUNICODE documentation, only the #128-#255 range is affected.

This wouldn't seem to matter, if you're only thinking about single-byte ANSI codepages that are supersets of ASCII.
However, there's multi-byte ANSI codepages that aren't supersets of ASCII (I think?), and it seems like this aspect of the behavior would matter for interpreting those.

With that being said, I've looked at the Shift_JIS character table and found that the first 127 characters are still ASCII.
Maybe it really doesn't matter and I'm just getting muddled up with the fact that there are codepages that aren't binary supersets of ASCII.

Even so, we should probably follow what the documentation says.

cirras · 2025-07-25T03:40:36Z

...-frontend/src/test/java/au/com/integradev/delphi/antlr/ast/node/TextLiteralNodeImplTest.java

-    TextLiteralNodeImpl node = new TextLiteralNodeImpl(DelphiLexer.TkTextLiteral);
+  @ParameterizedTest(name = "HIGHCHARUNICODE = {0}")
+  @ValueSource(booleans = {true, false})
+  void testGetImageWithCharacterEscapes(boolean highCharUnicode) {


The switch also affects the type of the expression - not just the way the string is interpreted.

cirras · 2025-07-25T03:41:50Z

delphi-frontend/src/main/java/au/com/integradev/delphi/antlr/ast/node/TextLiteralNodeImpl.java

+        .isActiveSwitch(SwitchKind.HIGHCHARUNICODE, getTokenIndex());
+  }
+
+  public Charset getAnsiCharset() {


I think we should probably have a CharsetUtils or something in frontend.
I'd also call these nativeCharset.

cirras · 2025-07-25T03:52:45Z

delphi-frontend/src/main/java/au/com/integradev/delphi/antlr/ast/node/TextLiteralNodeImpl.java

+  }
+
+  public Charset getAnsiCharset() {
+    return Charset.forName(System.getProperty("native.encoding"));


Assuming that the compiler is actually using the configured codepage to interpret these escapes, we should:

read DCC_Codepage from dproj files

emit a warning if conflicting DCC_Codepage values are found

expose an analyzer property to override DCC_Codepage and ignore it altogether

fall back to the system's native encoding if none of these are provided

delphi-frontend/src/main/java/au/com/integradev/delphi/antlr/ast/node/TextLiteralNodeImpl.java

Add full support for HIGHCHARUNICODE

caa6e91

fourls requested a review from cirras July 23, 2025 03:39

cirras requested changes Jul 25, 2025

View reviewed changes

delphi-frontend/src/main/java/au/com/integradev/delphi/antlr/ast/node/TextLiteralNodeImpl.java Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add full support for `HIGHCHARUNICODE` #397

Add full support for `HIGHCHARUNICODE` #397

Uh oh!

fourls commented Jul 23, 2025

Uh oh!

fourls commented Jul 23, 2025

Uh oh!

zaneduffield commented Jul 23, 2025

Uh oh!

zaneduffield commented Jul 23, 2025

Uh oh!

zaneduffield commented Jul 23, 2025 •

edited

Loading

Uh oh!

cirras Jul 25, 2025

Uh oh!

cirras Jul 25, 2025

Uh oh!

cirras Jul 25, 2025

Uh oh!

cirras Jul 25, 2025

Uh oh!

cirras Jul 25, 2025

Uh oh!

cirras Jul 25, 2025

Uh oh!

Uh oh!

Uh oh!

Add full support for HIGHCHARUNICODE #397

Are you sure you want to change the base?

Add full support for HIGHCHARUNICODE #397

Uh oh!

Conversation

fourls commented Jul 23, 2025

Uh oh!

fourls commented Jul 23, 2025

Uh oh!

zaneduffield commented Jul 23, 2025

Uh oh!

zaneduffield commented Jul 23, 2025

Uh oh!

zaneduffield commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cirras Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

cirras Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

cirras Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

cirras Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

cirras Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

cirras Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Add full support for `HIGHCHARUNICODE` #397

Add full support for `HIGHCHARUNICODE` #397

zaneduffield commented Jul 23, 2025 •

edited

Loading