-
Notifications
You must be signed in to change notification settings - Fork 25
Add full support for HIGHCHARUNICODE
#397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
One thing I forgot to note - I removed support for binary character escapes (e.g. |
The system encoding isn't what the compiler is using. The compiler is using the configured codepage, which may be 0, which means the system encoding. |
Actually... let me check that. |
It's a bit subtle, but I was right that the compiler does use the configured codepage for this kind of thing. The reason it's subtle, is that if the type of the variable being assigned to is I'm not sure how exactly this affects the change here, but I would expect the configured codepage to become involved at some point. |
private char characterEscapeToChar(String image) { | ||
image = image.substring(1); | ||
int radix = 10; | ||
|
||
switch (image.charAt(0)) { | ||
case '$': | ||
radix = 16; | ||
image = image.substring(1); | ||
break; | ||
case '%': | ||
radix = 2; | ||
image = image.substring(1); | ||
break; | ||
default: | ||
// do nothing | ||
if (image.charAt(0) == '$') { | ||
radix = 16; | ||
image = image.substring(1); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the handling for binary character escapes even though they're currently not syntactically valid.
Our grammar allows it, and it's arguably a bug that the compiler doesn't currently recognize them. I think it's likely a future Delphi version will fix that.
if (isHighCharUnicode() || character > 255) { | ||
// With HIGHCHARUNICODE ON, all escapes are interpreted as UTF-16. | ||
// Escapes above 255 are always interpreted as UTF-16. | ||
return character; | ||
} else { | ||
// With HIGHCHARUNICODE OFF, escapes between 0-255 are interpreted in the system code page. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the HIGHCHARUNICODE
documentation, only the #128
-#255
range is affected.
This wouldn't seem to matter, if you're only thinking about single-byte ANSI codepages that are supersets of ASCII.
However, there's multi-byte ANSI codepages that aren't supersets of ASCII (I think?), and it seems like this aspect of the behavior would matter for interpreting those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With that being said, I've looked at the Shift_JIS character table and found that the first 127 characters are still ASCII.
Maybe it really doesn't matter and I'm just getting muddled up with the fact that there are codepages that aren't binary supersets of ASCII.
Even so, we should probably follow what the documentation says.
TextLiteralNodeImpl node = new TextLiteralNodeImpl(DelphiLexer.TkTextLiteral); | ||
@ParameterizedTest(name = "HIGHCHARUNICODE = {0}") | ||
@ValueSource(booleans = {true, false}) | ||
void testGetImageWithCharacterEscapes(boolean highCharUnicode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The switch also affects the type of the expression - not just the way the string is interpreted.
.isActiveSwitch(SwitchKind.HIGHCHARUNICODE, getTokenIndex()); | ||
} | ||
|
||
public Charset getAnsiCharset() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should probably have a CharsetUtils
or something in frontend.
I'd also call these nativeCharset
.
} | ||
|
||
public Charset getAnsiCharset() { | ||
return Charset.forName(System.getProperty("native.encoding")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that the compiler is actually using the configured codepage to interpret these escapes, we should:
- read
DCC_Codepage
from dproj files - emit a warning if conflicting
DCC_Codepage
values are found - expose an analyzer property to override
DCC_Codepage
and ignore it altogether - fall back to the system's native encoding if none of these are provided
Fixes #368 by improving
TextLiteralNode::characterEscapeToChar
. Note that the interpretation of "ANSI encoding" is the system native encoding, so it is assumed that the system the Sonar scan runs on has the same native encoding as the system compiling the code.