Skip to content

Conversation

fourls
Copy link
Collaborator

@fourls fourls commented Jul 23, 2025

Fixes #368 by improving TextLiteralNode::characterEscapeToChar. Note that the interpretation of "ANSI encoding" is the system native encoding, so it is assumed that the system the Sonar scan runs on has the same native encoding as the system compiling the code.

@fourls fourls requested a review from cirras July 23, 2025 03:39
@fourls
Copy link
Collaborator Author

fourls commented Jul 23, 2025

One thing I forgot to note - I removed support for binary character escapes (e.g. #%11101110) since as far as I can tell Delphi does not recognise them.

@zaneduffield
Copy link
Collaborator

The system encoding isn't what the compiler is using. The compiler is using the configured codepage, which may be 0, which means the system encoding.

@zaneduffield
Copy link
Collaborator

Actually... let me check that.

@zaneduffield
Copy link
Collaborator

zaneduffield commented Jul 23, 2025

It's a bit subtle, but I was right that the compiler does use the configured codepage for this kind of thing.

The reason it's subtle, is that if the type of the variable being assigned to is AnsiChar (or AnsiString), then the codepage at compile-time is irrelevant, because the bytes will be stored as written and interpreted at runtime. However if the type of the variable being assigned to is WideChar (or WideString), then any ANSI bytes from compile-time would have been converted to Unicode at compile time using the configured codepage.

I'm not sure how exactly this affects the change here, but I would expect the configured codepage to become involved at some point.

Comment on lines +184 to 191
private char characterEscapeToChar(String image) {
image = image.substring(1);
int radix = 10;

switch (image.charAt(0)) {
case '$':
radix = 16;
image = image.substring(1);
break;
case '%':
radix = 2;
image = image.substring(1);
break;
default:
// do nothing
if (image.charAt(0) == '$') {
radix = 16;
image = image.substring(1);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the handling for binary character escapes even though they're currently not syntactically valid.
Our grammar allows it, and it's arguably a bug that the compiler doesn't currently recognize them. I think it's likely a future Delphi version will fix that.

Comment on lines +196 to +201
if (isHighCharUnicode() || character > 255) {
// With HIGHCHARUNICODE ON, all escapes are interpreted as UTF-16.
// Escapes above 255 are always interpreted as UTF-16.
return character;
} else {
// With HIGHCHARUNICODE OFF, escapes between 0-255 are interpreted in the system code page.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the HIGHCHARUNICODE documentation, only the #128-#255 range is affected.

This wouldn't seem to matter, if you're only thinking about single-byte ANSI codepages that are supersets of ASCII.
However, there's multi-byte ANSI codepages that aren't supersets of ASCII (I think?), and it seems like this aspect of the behavior would matter for interpreting those.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With that being said, I've looked at the Shift_JIS character table and found that the first 127 characters are still ASCII.
Maybe it really doesn't matter and I'm just getting muddled up with the fact that there are codepages that aren't binary supersets of ASCII.

Even so, we should probably follow what the documentation says.

TextLiteralNodeImpl node = new TextLiteralNodeImpl(DelphiLexer.TkTextLiteral);
@ParameterizedTest(name = "HIGHCHARUNICODE = {0}")
@ValueSource(booleans = {true, false})
void testGetImageWithCharacterEscapes(boolean highCharUnicode) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The switch also affects the type of the expression - not just the way the string is interpreted.

.isActiveSwitch(SwitchKind.HIGHCHARUNICODE, getTokenIndex());
}

public Charset getAnsiCharset() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably have a CharsetUtils or something in frontend.
I'd also call these nativeCharset.

}

public Charset getAnsiCharset() {
return Charset.forName(System.getProperty("native.encoding"));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that the compiler is actually using the configured codepage to interpret these escapes, we should:

  • read DCC_Codepage from dproj files
  • emit a warning if conflicting DCC_Codepage values are found
  • expose an analyzer property to override DCC_Codepage and ignore it altogether
  • fall back to the system's native encoding if none of these are provided

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Type inference for character literals (and HIGHCHARUNICODE handling)
3 participants