Skip to content

Commit a85f11e

Browse files
committed
Clarify UNICODE_ESCAPE valid token value
This clarifies the UNICODE_ESCAPE rule that the hex value must be a valid Unicode scalar value. This resolves the problem that a string like `"\u{ffffff}"` is not a valid token, but the grammar did not reflect that. I don't see a practical way to define this with character ranges. The resulting expression is huge. Note that this restriction means that the UNICODE_ESCAPE rule will not match an invalid value, and that all the places where UNICODE_ESCAPE is used, the preceding character must *not* be `\`, which forces those rules to fail their match. In turn the only rules that contain UNICODE_ESCAPE have `'` or `"` characters, which won't match any other rule in the grammar, forcing them to fail the parse. If all those assumptions seem too fragile, then we can consider adding the [cut operator](rust-lang#2104) just after the `\u` so that the interpretation is clear that a failure to match the part from the opening brace is an immediate parse failure.
1 parent 11f84ce commit a85f11e

File tree

1 file changed

+4
-2
lines changed

1 file changed

+4
-2
lines changed

src/tokens.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -157,9 +157,11 @@ ASCII_ESCAPE ->
157157
| `\n` | `\r` | `\t` | `\\` | `\0`
158158
159159
UNICODE_ESCAPE ->
160-
`\u{` ( HEX_DIGIT `_`* ){1..6} `}`
160+
`\u{` ( HEX_DIGIT `_`* ){1..6} _valid hex char value_ `}`[^valid-hex-char]
161161
```
162162

163+
[^valid-hex-char]: See [lex.token.literal.char-escape.unicode].
164+
163165
r[lex.token.literal.char.intro]
164166
A _character literal_ is a single Unicode character enclosed within two `U+0027` (single-quote) characters, with the exception of `U+0027` itself, which must be _escaped_ by a preceding `U+005C` character (`\`).
165167

@@ -196,7 +198,7 @@ r[lex.token.literal.char-escape.ascii]
196198
* A _7-bit code point escape_ starts with `U+0078` (`x`) and is followed by exactly two _hex digits_ with value up to `0x7F`. It denotes the ASCII character with value equal to the provided hex value. Higher values are not permitted because it is ambiguous whether they mean Unicode code points or byte values.
197199

198200
r[lex.token.literal.char-escape.unicode]
199-
* A _24-bit code point escape_ starts with `U+0075` (`u`) and is followed by up to six _hex digits_ surrounded by braces `U+007B` (`{`) and `U+007D` (`}`). It denotes the Unicode code point equal to the provided hex value.
201+
* A _24-bit code point escape_ starts with `U+0075` (`u`) and is followed by up to six _hex digits_ surrounded by braces `U+007B` (`{`) and `U+007D` (`}`). It denotes the Unicode code point equal to the provided hex value. The value must be a valid Unicode scalar value.
200202

201203
r[lex.token.literal.char-escape.whitespace]
202204
* A _whitespace escape_ is one of the characters `U+006E` (`n`), `U+0072` (`r`), or `U+0074` (`t`), denoting the Unicode values `U+000A` (LF), `U+000D` (CR) or `U+0009` (HT) respectively.

0 commit comments

Comments
 (0)