Use the offset at EOF for LexErrorKind::PrematureEnd. #340

ratmice · 2022-08-12T21:55:25Z

Previously this used the character before EOF. While there
is currently no way to produce this error such that it ends
in a multibyte character. This makes the behavior match yacc.

ratmice · 2022-08-13T23:13:52Z

While there is currently no way to produce this error such that it ends in a multibyte character.

It looks like this might be a misstatement, just beginning to analyze the results of a fuzzing run where this appears to have failed
and I believe it might be possible to end in a multibyte whitespace character presumably because of is_whitespace().

ratmice · 2022-08-14T00:55:39Z

cfgrammar/src/lib/yacc/parser.rs

@@ -1485,6 +1485,12 @@ A:
            &src,
        )
        .expect_error_at_line_col(&src, YaccGrammarErrorKind::PrematureEnd, 1, 17);
+        let src = "// 🦀".to_string();


While this is a Yacc test rather than a Lex one, It appears to be a nice way to produce this error that doesn't involve invalid code gen.

I added it here, because i'm not certain that accepting multibyte whitespace is actually intended, or something we just accidentally accept.
If that is the case, (I don't believe lex accepts comments), but if we added them we could reproduce PrematureEnd with a comment like this.

ratmice · 2022-08-14T00:59:07Z

@ltratt mind chiming in on whether we should be accepting multibyte whitespace characters in fbcc378 or not?

ltratt · 2022-08-14T07:49:08Z

I didn't know there was such a thing as multibyte whitespace! Since we do aim to support unicode properly, I think the right thing to do is to deal with multibyte whitespace properly, but I can be persuaded otherwise.

ratmice · 2022-08-14T08:04:00Z

Okay, I believe that the extent of that might be just changing the parse_ws function to use

if !c.is_whitespace() {
 break 
}

But I'll have to see what the callers think of that change.

ratmice · 2022-08-14T10:03:28Z

Contrary to the above, I believe it would be worthwhile to just follow https://www.unicode.org/reports/tr31/ which is intended for parsing/lexing of unicode
and the Pattern_Whitespace property defined in, https://www.unicode.org/Public/13.0.0/ucd/PropList.txt

In addition to e.g. 'is_whitespace' we would need to identify the SpaceSeparator Zs class subset of Whitespace or Pattern_Whitespace.

For Pattern_Whitespace this SpaceSeparator looks to be just the typical ASCII space character. Pattern_Whitespace does contain some multibyte characters still, e.g. LineSeparator and ParagraphSeparator so we'd still encounter this issue with those.

There is an issue in unicode-xid, pointing to some rustc code which might be helpful,
(Or I could try and solve it there, if you feel like having it as a dependency would be ok, it looks like we actually already depend on it for the serde feature).
unicode-rs/unicode-xid#1

Anyhow it seems to me, this would be the relevant unicode guidelines

ltratt · 2022-08-14T13:54:46Z

Unicode makes my head hurt, so I'm happy to go with what you think best (provided it doesn't break anything obvious, and we have some new tests to make sure we don't introduce new oddities)!

ratmice · 2022-08-14T21:35:40Z

So, I think one of the reasons it makes sense not to use is_whitespace() is things like Ogham Space Mark,
which is to be read right to left (or bottom to top I would guess?), so it is kind of strange to allow it in something that isn't really right-to-left etc.

Anyhow, editor support for all these non-ASCII characters is pretty terrible, i'm really not sure how I would actually feel about people using them. But even if we decide to not use the \p{Pattern_white_Space} things, it at least should be fairly simple now to change them to whatever we decide on and handle it uniformly across the code base.

The regex's could be a bit nicer, but I don't think there is a \p{gc=Space_Separator} etc in rusts regex, I at least couldn't find anything... 🤷

ltratt · 2022-08-14T22:01:23Z

lrlex/src/lib/parser.rs

@@ -1311,4 +1320,16 @@ a\[\]a 'aboxa'
        }
        LRNonStreamingLexerDef::<DefaultLexeme<u8>, u8>::from_str(&src).ok();
    }
+
+    /// Test with various /// [Pattern_White_Space](https://unicode.org/reports/tr31/) separators.


I'm not sure about the inner ///?

Fixed in 7635044

ltratt · 2022-08-14T22:04:07Z

I had to look up Ogham -- I had no idea it existed!

It feels to me like this PR probably does enough as-is to make the situation meaningfully better. If anyone does try to write lex/yacc files in the same manner as 4th-6th century Irish, we can revisit this. [Via the wonders of wikipedia I now know that I can find such writing in Cornwall...]

ltratt · 2022-08-14T22:04:40Z

I think this is ready for squashing? If so, please go ahead.

lrlex/src/lib/parser.rs

ratmice · 2022-08-14T22:32:53Z

Before I squash,

I would like to understand why in that last testcase I added, the first one gets InvalidStartStateName, while the second one gets UnknownDeclaration, I think it would be ideal if they both got InvalidStartStateName. but i'll have to look at it tomorrow.

ltratt · 2022-08-15T10:02:06Z

Please squash.

Previously, we would use hard-coded ASCII whitespace characters, while also using rust std library functions that use unicode whitespace. This adds a series of regular expressions, and helper functions for use with the rust std library functions `trim_matches()`, etc. So that accepted whitespace characters are treated uniformly. The subset of whitespace characters now accepted are those specified in `Pattern_White_Space` from https://unicode.org/reports/tr31/

ratmice · 2022-08-15T10:34:35Z

Squashed.

ltratt · 2022-08-15T10:39:33Z

bors r+

bors · 2022-08-15T10:43:18Z

Build succeeded:

buildbot/buildbot-build-script

ratmice commented Aug 14, 2022

View reviewed changes

ltratt self-assigned this Aug 14, 2022

ltratt reviewed Aug 14, 2022

View reviewed changes

ratmice commented Aug 14, 2022

View reviewed changes

lrlex/src/lib/parser.rs Show resolved Hide resolved

ratmice added 2 commits August 15, 2022 03:17

Test src slicing from spans, and multibyte PrematureEnd for lex.

85c511c

ratmice force-pushed the lex_premature_end_offset branch from 54825ad to 38e92e9 Compare August 15, 2022 10:33

bors bot merged commit cb99d23 into softdevteam:master Aug 15, 2022

ratmice deleted the lex_premature_end_offset branch August 15, 2022 10:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use the offset at EOF for LexErrorKind::PrematureEnd. #340

Use the offset at EOF for LexErrorKind::PrematureEnd. #340

ratmice commented Aug 12, 2022

ratmice commented Aug 13, 2022

ratmice Aug 14, 2022

ratmice commented Aug 14, 2022 •

edited

Loading

ltratt commented Aug 14, 2022

ratmice commented Aug 14, 2022 •

edited

Loading

ratmice commented Aug 14, 2022

ltratt commented Aug 14, 2022

ratmice commented Aug 14, 2022

ltratt Aug 14, 2022

ratmice Aug 14, 2022

ltratt commented Aug 14, 2022

ltratt commented Aug 14, 2022

ratmice commented Aug 14, 2022

ltratt commented Aug 15, 2022

ratmice commented Aug 15, 2022

ltratt commented Aug 15, 2022

bors bot commented Aug 15, 2022

Use the offset at EOF for LexErrorKind::PrematureEnd. #340

Use the offset at EOF for LexErrorKind::PrematureEnd. #340

Conversation

ratmice commented Aug 12, 2022

ratmice commented Aug 13, 2022

ratmice Aug 14, 2022

Choose a reason for hiding this comment

ratmice commented Aug 14, 2022 • edited Loading

ltratt commented Aug 14, 2022

ratmice commented Aug 14, 2022 • edited Loading

ratmice commented Aug 14, 2022

ltratt commented Aug 14, 2022

ratmice commented Aug 14, 2022

ltratt Aug 14, 2022

Choose a reason for hiding this comment

ratmice Aug 14, 2022

Choose a reason for hiding this comment

ltratt commented Aug 14, 2022

ltratt commented Aug 14, 2022

ratmice commented Aug 14, 2022

ltratt commented Aug 15, 2022

ratmice commented Aug 15, 2022

ltratt commented Aug 15, 2022

bors bot commented Aug 15, 2022

ratmice commented Aug 14, 2022 •

edited

Loading

ratmice commented Aug 14, 2022 •

edited

Loading