Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the offset at EOF for LexErrorKind::PrematureEnd. #340

Merged
merged 2 commits into from
Aug 15, 2022

Conversation

ratmice
Copy link
Collaborator

@ratmice ratmice commented Aug 12, 2022

Previously this used the character before EOF. While there
is currently no way to produce this error such that it ends
in a multibyte character. This makes the behavior match yacc.

@ratmice
Copy link
Collaborator Author

ratmice commented Aug 13, 2022

While there is currently no way to produce this error such that it ends in a multibyte character.

It looks like this might be a misstatement, just beginning to analyze the results of a fuzzing run where this appears to have failed
and I believe it might be possible to end in a multibyte whitespace character presumably because of is_whitespace().

@@ -1485,6 +1485,12 @@ A:
&src,
)
.expect_error_at_line_col(&src, YaccGrammarErrorKind::PrematureEnd, 1, 17);
let src = "// 🦀".to_string();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is a Yacc test rather than a Lex one, It appears to be a nice way to produce this error that doesn't involve invalid code gen.

I added it here, because i'm not certain that accepting multibyte whitespace is actually intended, or something we just accidentally accept.
If that is the case, (I don't believe lex accepts comments), but if we added them we could reproduce PrematureEnd with a comment like this.

@ratmice
Copy link
Collaborator Author

ratmice commented Aug 14, 2022

@ltratt mind chiming in on whether we should be accepting multibyte whitespace characters in fbcc378 or not?

@ltratt ltratt self-assigned this Aug 14, 2022
@ltratt
Copy link
Member

ltratt commented Aug 14, 2022

I didn't know there was such a thing as multibyte whitespace! Since we do aim to support unicode properly, I think the right thing to do is to deal with multibyte whitespace properly, but I can be persuaded otherwise.

@ratmice
Copy link
Collaborator Author

ratmice commented Aug 14, 2022

Okay, I believe that the extent of that might be just changing the parse_ws function to use

if !c.is_whitespace() {
 break 
}

But I'll have to see what the callers think of that change.

@ratmice
Copy link
Collaborator Author

ratmice commented Aug 14, 2022

Contrary to the above, I believe it would be worthwhile to just follow https://www.unicode.org/reports/tr31/ which is intended for parsing/lexing of unicode
and the Pattern_Whitespace property defined in, https://www.unicode.org/Public/13.0.0/ucd/PropList.txt

In addition to e.g. 'is_whitespace' we would need to identify the SpaceSeparator Zs class subset of Whitespace or Pattern_Whitespace.

For Pattern_Whitespace this SpaceSeparator looks to be just the typical ASCII space character. Pattern_Whitespace does contain some multibyte characters still, e.g. LineSeparator and ParagraphSeparator so we'd still encounter this issue with those.

There is an issue in unicode-xid, pointing to some rustc code which might be helpful,
(Or I could try and solve it there, if you feel like having it as a dependency would be ok, it looks like we actually already depend on it for the serde feature).
unicode-rs/unicode-xid#1

Anyhow it seems to me, this would be the relevant unicode guidelines

@ltratt
Copy link
Member

ltratt commented Aug 14, 2022

Unicode makes my head hurt, so I'm happy to go with what you think best (provided it doesn't break anything obvious, and we have some new tests to make sure we don't introduce new oddities)!

@ratmice
Copy link
Collaborator Author

ratmice commented Aug 14, 2022

So, I think one of the reasons it makes sense not to use is_whitespace() is things like Ogham Space Mark,
which is to be read right to left (or bottom to top I would guess?), so it is kind of strange to allow it in something that isn't really right-to-left etc.

Anyhow, editor support for all these non-ASCII characters is pretty terrible, i'm really not sure how I would actually feel about people using them. But even if we decide to not use the \p{Pattern_white_Space} things, it at least should be fairly simple now to change them to whatever we decide on and handle it uniformly across the code base.

The regex's could be a bit nicer, but I don't think there is a \p{gc=Space_Separator} etc in rusts regex, I at least couldn't find anything... 🤷

@@ -1311,4 +1320,16 @@ a\[\]a 'aboxa'
}
LRNonStreamingLexerDef::<DefaultLexeme<u8>, u8>::from_str(&src).ok();
}

/// Test with various /// [Pattern_White_Space](https://unicode.org/reports/tr31/) separators.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the inner ///?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7635044

@ltratt
Copy link
Member

ltratt commented Aug 14, 2022

I had to look up Ogham -- I had no idea it existed!

It feels to me like this PR probably does enough as-is to make the situation meaningfully better. If anyone does try to write lex/yacc files in the same manner as 4th-6th century Irish, we can revisit this. [Via the wonders of wikipedia I now know that I can find such writing in Cornwall...]

@ltratt
Copy link
Member

ltratt commented Aug 14, 2022

I think this is ready for squashing? If so, please go ahead.

@ratmice
Copy link
Collaborator Author

ratmice commented Aug 14, 2022

Before I squash,

I would like to understand why in that last testcase I added, the first one gets InvalidStartStateName, while the second one gets UnknownDeclaration, I think it would be ideal if they both got InvalidStartStateName. but i'll have to look at it tomorrow.

@ltratt
Copy link
Member

ltratt commented Aug 15, 2022

Please squash.

Previously, we would use hard-coded ASCII whitespace characters,
while also using rust std library functions that use unicode whitespace.

This adds a series of regular expressions, and helper functions
for use with the rust std library functions `trim_matches()`, etc.
So that accepted whitespace characters are treated uniformly.
The subset of whitespace characters now accepted are those specified in
`Pattern_White_Space` from https://unicode.org/reports/tr31/
@ratmice
Copy link
Collaborator Author

ratmice commented Aug 15, 2022

Squashed.

@ltratt
Copy link
Member

ltratt commented Aug 15, 2022

bors r+

@bors
Copy link
Contributor

bors bot commented Aug 15, 2022

Build succeeded:

@bors bors bot merged commit cb99d23 into softdevteam:master Aug 15, 2022
@ratmice ratmice deleted the lex_premature_end_offset branch August 15, 2022 10:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants