Clarify UNICODE_ESCAPE valid token value

ehuss · ehuss · commit a85f11e1791d · 2025-12-20T12:03:40.000-08:00
This clarifies the UNICODE_ESCAPE rule that the hex value must be a valid Unicode scalar value. This resolves the problem that a string like `"\u{ffffff}"` is not a valid token, but the grammar did not reflect that. I don't see a practical way to define this with character ranges. The resulting expression is huge. Note that this restriction means that the UNICODE_ESCAPE rule will not match an invalid value, and that all the places where UNICODE_ESCAPE is used, the preceding character must *not* be `\`, which forces those rules to fail their match. In turn the only rules that contain UNICODE_ESCAPE have `'` or `"` characters, which won't match any other rule in the grammar, forcing them to fail the parse. If all those assumptions seem too fragile, then we can consider adding the [cut operator](rust-lang#2104) just after the `\u` so that the interpretation is clear that a failure to match the part from the opening brace is an immediate parse failure.
diff --git a/src/tokens.md b/src/tokens.md
@@ -157,9 +157,11 @@ ASCII_ESCAPE ->
     | `\n` | `\r` | `\t` | `\\` | `\0`
 
 UNICODE_ESCAPE ->
-    `\u{` ( HEX_DIGIT `_`* ){1..6} `}`
+    `\u{` ( HEX_DIGIT `_`* ){1..6} _valid hex char value_ `}`[^valid-hex-char]
 ```
 
+[^valid-hex-char]: See [lex.token.literal.char-escape.unicode].
+
 r[lex.token.literal.char.intro]
 A _character literal_ is a single Unicode character enclosed within two `U+0027` (single-quote) characters, with the exception of `U+0027` itself, which must be _escaped_ by a preceding `U+005C` character (`\`).
 
@@ -196,7 +198,7 @@ r[lex.token.literal.char-escape.ascii]
 * A _7-bit code point escape_ starts with `U+0078` (`x`) and is followed by exactly two _hex digits_ with value up to `0x7F`. It denotes the ASCII character with value equal to the provided hex value. Higher values are not permitted because it is ambiguous whether they mean Unicode code points or byte values.
 
 r[lex.token.literal.char-escape.unicode]
-* A _24-bit code point escape_ starts with `U+0075` (`u`) and is followed by up to six _hex digits_ surrounded by braces `U+007B` (`{`) and `U+007D` (`}`). It denotes the Unicode code point equal to the provided hex value.
+* A _24-bit code point escape_ starts with `U+0075` (`u`) and is followed by up to six _hex digits_ surrounded by braces `U+007B` (`{`) and `U+007D` (`}`). It denotes the Unicode code point equal to the provided hex value. The value must be a valid Unicode scalar value.
 
 r[lex.token.literal.char-escape.whitespace]
 * A _whitespace escape_ is one of the characters `U+006E` (`n`), `U+0072` (`r`), or `U+0074` (`t`), denoting the Unicode values `U+000A` (LF), `U+000D` (CR) or `U+0009` (HT) respectively.