-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use the offset at EOF for LexErrorKind::PrematureEnd. #340
Conversation
It looks like this might be a misstatement, just beginning to analyze the results of a fuzzing run where this appears to have failed |
@@ -1485,6 +1485,12 @@ A: | |||
&src, | |||
) | |||
.expect_error_at_line_col(&src, YaccGrammarErrorKind::PrematureEnd, 1, 17); | |||
let src = "// 🦀".to_string(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this is a Yacc test rather than a Lex one, It appears to be a nice way to produce this error that doesn't involve invalid code gen.
I added it here, because i'm not certain that accepting multibyte whitespace is actually intended, or something we just accidentally accept.
If that is the case, (I don't believe lex accepts comments), but if we added them we could reproduce PrematureEnd with a comment like this.
I didn't know there was such a thing as multibyte whitespace! Since we do aim to support unicode properly, I think the right thing to do is to deal with multibyte whitespace properly, but I can be persuaded otherwise. |
Okay, I believe that the extent of that might be just changing the
But I'll have to see what the callers think of that change. |
Contrary to the above, I believe it would be worthwhile to just follow https://www.unicode.org/reports/tr31/ which is intended for parsing/lexing of unicode In addition to e.g. 'is_whitespace' we would need to identify the SpaceSeparator For Pattern_Whitespace this SpaceSeparator looks to be just the typical ASCII space character. Pattern_Whitespace does contain some multibyte characters still, e.g. LineSeparator and ParagraphSeparator so we'd still encounter this issue with those. There is an issue in unicode-xid, pointing to some rustc code which might be helpful, Anyhow it seems to me, this would be the relevant unicode guidelines |
Unicode makes my head hurt, so I'm happy to go with what you think best (provided it doesn't break anything obvious, and we have some new tests to make sure we don't introduce new oddities)! |
So, I think one of the reasons it makes sense not to use Anyhow, editor support for all these non-ASCII characters is pretty terrible, i'm really not sure how I would actually feel about people using them. But even if we decide to not use the The regex's could be a bit nicer, but I don't think there is a |
lrlex/src/lib/parser.rs
Outdated
@@ -1311,4 +1320,16 @@ a\[\]a 'aboxa' | |||
} | |||
LRNonStreamingLexerDef::<DefaultLexeme<u8>, u8>::from_str(&src).ok(); | |||
} | |||
|
|||
/// Test with various /// [Pattern_White_Space](https://unicode.org/reports/tr31/) separators. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the inner ///
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 7635044
I had to look up Ogham -- I had no idea it existed! It feels to me like this PR probably does enough as-is to make the situation meaningfully better. If anyone does try to write lex/yacc files in the same manner as 4th-6th century Irish, we can revisit this. [Via the wonders of wikipedia I now know that I can find such writing in Cornwall...] |
I think this is ready for squashing? If so, please go ahead. |
Before I squash, I would like to understand why in that last testcase I added, the first one gets |
Please squash. |
Previously, we would use hard-coded ASCII whitespace characters, while also using rust std library functions that use unicode whitespace. This adds a series of regular expressions, and helper functions for use with the rust std library functions `trim_matches()`, etc. So that accepted whitespace characters are treated uniformly. The subset of whitespace characters now accepted are those specified in `Pattern_White_Space` from https://unicode.org/reports/tr31/
54825ad
to
38e92e9
Compare
Squashed. |
bors r+ |
Build succeeded: |
Previously this used the character before EOF. While there
is currently no way to produce this error such that it ends
in a multibyte character. This makes the behavior match yacc.