-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Canonical Rust grammar distinct from parser (tracking issue for RFC #1331) #30942
Comments
I’m curious what kind of testing can check that a grammar matches the not-grammar-based parser. |
Well, it can show the presence of bugs, but not the absence of them. https://github.com/rust-lang/rust/blob/master/mk/grammar.mk is what we used to do. Just vestigial at this point :) |
In intellij-rust, we have regression tests for the parser, which consist of a Rust file and a serialized AST. Then we check that the parser produces the expected AST. We also have some negative "you shall not parse" tests. The same technique can be applied to the reference grammar, if we ensure that the serialized AST is sufficiently abstract and doesn't leak grammar "implementation details". The question here is what tool should be used for the reference grammar? I think it has to goals:
One option is to continue to use Bison. In my opinion, it doesn't fulfill the second goal: grammar is difficult to read because of large amount of duplication and low level details. Regular expression extensions would be really nice to have in the canonical grammar. Another extreme option is to build a custom parser generator (LALRPOP ?) :) It would have a nice side effect of making Rust more suitable for efficiently implementing programming languages: a realm currently occupied by C/C++. |
I have opinions but not much time. I've been interesting in porting the Rust grammar to LALRPOP -- I think it'd be much cleaner, as LALRPOP has a number of features that should allow us to avoid some of the abuse of precedence declarations and the like. (I found the existing grammar not very useful in evaluating syntactic changes because of those precedence declarations, for example.) (As an aside, this would probably be a good stress test for LALRPOP, I assume we'd have to fix various bugs or add new features to scale it up that large.) |
So this morning I started porting the yacc grammar in the repository to LALRPOP, just to see what would happen. I made some progress. But I was wondering: is that grammar the most up to date version, or is the one in the repo more up to date? Does anyone know what the differences are between them? |
I was thinking about working on that as well, but I unfortunately don't have too much
|
The (parser-lalr) grammar originally comes from https://github.com/bleibig/rust-grammar, which hasn’t been updated in a while. |
I will do this. I am right now just working through some LALRPOP issues that arose (I'm not sure if it's a bug or if LALRPOP just needs optimization; I suspect both :) but once I get that sorted out a bit I will open up a repo and post the link here. Hopefully tomorrow. It seems @TyOverby may be interested as well, and no doubt others. For example, I was talking to @jorendorff, who developed https://github.com/jorendorff/rust-grammar/, as well. |
OK, I put up a definite work in progress here: https://github.com/nikomatsakis/rustypop I also described some conventions that I am aiming towards, but have not yet achieved. This is hard work to parallelize in the early stages, but if you're interested in hacking on it, let me know over IRC or what have you (or just open some PRs). I'll probably alternate between working on it and improving LALRPOP (for example, I am now highly motivated to get a better printout on shift-reduce errors). =) |
If we can generate a parser from the official grammar, and if we can run |
@glaebhoerl: Well, there's also that the entire formalism the official grammar is based on on is called "generative grammars" - using them to create working parsers came after using them to create exemplars, which is essentially a matter of transforming the grammar into a tree and executing a depth-first search. For example:
We form the following tree:
With that, we then walk the tree, generating each possible string covered by the grammar. One can do this efficiently, without actually generating the tree, by assigning each optional subrule a bit, treating the collection of those bits as a number, and counting. Enabling a subrule will sometimes reveal another optional; in that case, push the current "number" onto a queue, and when you've finished counting the "basic optionals", pop items off the queue and count the newly-revealed optionals again, adding them to the queue if they reveal more. That ensures one will generate every rule, in their shortest exemplars, before continuing on to the next shortest, etc. |
OK, so there's probably something horribly wrong with it, but the rustypop crate now builds without any shift/reduce conflicts. I haven't actually tried RUNNING the code it produces, of course, and I have all empty actions, so it will only yield a "true/false" result. Plus I need to adapt the rustc tokenizer. But it seems like progress. :) Enough progress that it may be possible to start parallelizing the work (debugging shift/reduce conflicts is kind of a serial task...). Of course, I also expect that as soon as we try using this grammar we'll find that I did some bone-headed things that resolved all conflicts by just not parsing anything at all, or something like that. Anyway, I plan to write up a blog post about the approach I took, since it made use of a number of LALRPOP features to try and pare down duplication. I also took the liberty of ripping out various bits of obsolete syntax from the existing |
Oh, I should mention that it generates a 500MB |
Hmm, now I see some more conflicts. So maybe premature. But still, getting close I think. :) |
Another tool which can be used for the canonical grammar is antlr4. I have not used it myself, but I think it should be mentioned in this thread. |
I believe we used to use antlr4 before the yacc-based grammar got merged. |
Looks like only the lexer was implemented in antlr4: https://github.com/rust-lang/rust/tree/29bd9a06efd2f8c8a7b1102e2203cc0e6ae2dcba/src/grammar |
I used antlr4 to make https://github.com/jorendorff/rust-grammar and it has a few drawbacks:
|
@jorendorff I suggest using a macro preprocessor or build step to reduce some of the code duplication. |
cc #15880 (we'll want a bot for this) |
Hey, and what about lexer / parser split? Perhaps we should create a canonical lexical structure grammar before jumping onto the grammar for the whole language? cc @dns2utf8 |
@matklad I’d consider formalising lexer an inherent part of grammar formalisation. That being said, unlike the grammar, which has been extended significantly over time, lexer has stayed considerably constant (i.e. is a different problem space) and the reference is still pretty good at capturing the lexical structure of the language. |
The point is that lexer can be formalized before the rest of the grammar, so it is a good independent first step. Having an executable semi declarative specification of the lexer would help for technical reasons:
It can be better though! There are some corner cases like |
I would like to create a complete grammar first. I am currently at RustFest.eu if somebody is here too and would like to talk a little about it. |
Update grammar to parse current rust syntax Mainly addressing rust-lang#32723. This PR updates the bison grammar so that it can parse the current rust syntax, except for feature-gated syntax additions. It has been tested with all the tests in run-pass. The grammar in this repo doesn't have build logic anymore, but you can test it out in https://github.com/bleibig/rust-grammar, which has all of what's in this PR. If you are interested in having build logic and grammar tests again, I can look into implementing that as well. I'm aware that things are somewhat undecided as to what an official rust grammar should be from the discussion in rust-lang#30942. With this PR we can go back to having an up-to-date flex/bison based grammar, but the rustypop grammar looks interesting as well.
Hello Each open source jungle is difficult in the beginning. Well well I lack a grammar for Rust. For users. There are fragments in some documents but no cohesive. I am 70 years old and have used RUST for a few months and have done a tool https://github.com/willy610/bnf2railroad who reads a grammar in EBNF style and produces so-called railroad views. The views are raw TTY, html with anchors and svg. I have always thought that these railroad views have been a good complement to other documentation such as examples and small projects. Upon learning, I have interpreted the grammar files contained in RUST documentation. They are a bit incomplete and do not have the quality that other languages offer. Much is undefined. Together with the tool there are grammar for Pascal, Smalltalk, Lua, Json, rege and fragments of RUST. The working method could be:
I have also used the tool to generate better help information for using the tool (parameters of the program) Perhaps the tool could be useful internally for developers of RUST for, for example, specification, bug report etc |
@harpocrates I'm curious, what is the status of your rust parser? @matklad and I have been talking about trying to get a better effort going here. The rough plan is to start with your grammar (probably converted to LALRPOP and then perhaps clean it up). Also, would you be interested in being involved in any such effort? |
@nikomatsakis my rust parser is complete. I would be interested in helping out with any effort to convert to LALRPOP. What would be the best way to proceed? Note that converting to LALRPOP may not be completely trivial. My grammar is currently written for Happy (a Haskell parser generator) and makes use of a handful of Happy features, namely:
|
@harpocrates looking at your parser, I was curious, how have you tested it? I see a few tests in the repository, but I was wondering if you tested against e.g. the sources of the Rust compiler, or crates.io, or something like that. |
LALRPOP supports parameterized productions. It does not support pushing back tokens, but there is an alternative way to handle The trickier thing will be precedence, since LALRPOP does not support those sorts of declarations. We ought to be able to refactor the grammar though (I actually converted the existing The first thing I had planned to do, in any case, is to write a tool to convert your grammar into LALRPOP syntax, and see how LALRPOP feels about it. The next priority would be a testing harness. It's rather late here so I'll have to write more later but I'd love to collaborate, particularly since my personal time is quite limited. Separately, @matklad and I have been talking about extending LALRPOP with a way to generate default actions that will fire off events suitable for building a generic parse tree, roughly as described in their RFC. |
Hej Niko
The status of my product is stable and no bigger issues more than proper UNICOE/UTF processing remains.
Today only the ASCII subset is supported.
I really will - and can!? contribute - as I got a lot of time. I’m am retired but works at the computer
at least 8 hour a day.
As of writing I have a very tiny end expensive connection to the internet; just via phone as the
wlan supplier in this big city of Gothenburg have trouble with just my connection for more than a week. Crazy.
So please pass the git(s) to the most covering project on LALRPOP so I don’t have the surf around too much.
And also I would very much like some kind of use case or scenarios how the function/tool
should be used
Looking forward to the work and the result!
Kindly, Willy
… On 2018-03-29, at 18:27, Niko Matsakis ***@***.***> wrote:
@harpocrates I'm curious, what is the status of your rust parser? @matklad and I have been talking about trying to get a better effort going here. The rough plan is to start with your grammar (probably converted to LALRPOP and then perhaps clean it up). Also, would you be interested in being involved in any such effort?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
-----------------------------
Willy Svenningsson
Johannas Vag 28
S-425 42 Hisings Karra
Sweden
+46 0768 22 20 26
willy@fager.st
|
Hej
I copy and pasted grammar rules from https://doc.rust-lang.org/grammar.html for RUST
I did the same for Lua, for Smalltalk I read from the 'Smalltalk-80 The language’
and for Pascal of read the Nicklaus Wirth ’Algorithms + Data Structures = Program’
I have not locked into the RUST compiler source or any projects in crates.io.
My vision! with my work was to support writers of language documentation
giving them tools to work with - reference - subsets of proper/versioned grammar rules
as inclusions of both BNF rules and RailRoad graphs.
I assume that there are no explicit language rules in the compilers sources
but the rules appear in some pre-step for generating part of the complier.
That why I didn’t digger into the compiler source.
So from some versioned preprocess rule definitions there might
a possibility to extract rule snippets to be converted to BNF and RailRoad.
But I will look into both the source of RUST compiler and some crates.io
and the LALRPOP too.
Kindly, Willy
… On 2018-03-29, at 23:25, Niko Matsakis ***@***.***> wrote:
@harpocrates looking at your parser, I was curious, how have you tested it? I see a few tests in the repository, but I was wondering if you tested against e.g. the sources of the Rust compiler, or crates.io, or something like that.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
-----------------------------
Willy Svenningsson
Johannas Vag 28
S-425 42 Hisings Karra
Sweden
+46 0768 22 20 26
willy@fager.st
|
You may also be interested in following rust-lang/reference#221. The people working on the reference have been making great progress, and the up-to-date grammar (at https://brauliobz.github.io/rust-reference/grammar.html) appears to be getting closer to complete. |
Sorry for disappearing! After writing those messages, I got totally overwhelmed (and this week i'm actually on vacation, so I'll probably be slow to reply again.) I'm a bit confused about which project you are referring to -- do you mean the railroad diagrams you referenced here? If so, that seems like a cool visualization technique, but presumably it requires an EBNF grammar to start? At the moment, that last part is what I am most interesting in obtaining; note though that if we had a working LALRPOP grammar, it "desugars down" to a plain CFG internally, so we ought to be able to use your tool to visualize it.
That's great! Is it presently being tested? And, if so, how? |
Hej Niko
The project I’m referencing is my project #30942 (comment)
My focus is to keep the possibility to generate railroad view from grammars of kind EBNF.
And have the verification function there (Unused, missing rules etc)
At the moment I’m investing some RUST grammar approaches
1. The old yacc from RUST source
#30942 (comment)
2. From some md files in the Rust nursery
https://github.com/rust-lang-nursery/reference/tree/master/src
3. From the work by Jason Orendorff. I think he said who wrote the grammar in order to understand RUST
when writing the book Programming Rust. What a grammar and what a book!!
https://github.com/jorendorff/rust-grammar/blob/master/Rust.g4
4. and from nikos older? rusty-pop
https://github.com/nikomatsakis/rustypop/blob/master/src/rusty.lalrpop
5. but also LALRPOP
https://github.com/lalrpop/lalrpop
A. In most sources I struggled with character set descriptions and all the regeexpr.
So I introduced, in my EBNF, 'Character Set Expressions' with set, union, difference and range.
It will show up soon.
B. I think Mark Down ’marking’ is insufficient for a BNF.
Perhaps generating plain HTML snippets from rules.
They can be styled looking like program source. Color, font etc
C. EBNF decorated with types is a must. I’m not there yet
D. Actions in the rules like {} in yacc and => in LARPOP could
perhaps be specified in the EBNF as a (parametric) reference to other source
like
pub Expr: Box<Expr> = {
Expr ExprOp Factor @action(ActionName,Expr),
Factor,
};
So the @ or similar unused sign could be an element in the syntax.
Clean up grammar from actions and relate them using ’@' instead.
The intersection of EBNF and LALRPOP could be ...
E. Or perhaps the best. Have a tool exporting from LALRPOP as EBNF (with type)
And the continue with generation railroad and markdown files from that source.
I’m really sorry for working mostly off road.
I will probably not update the above bnf2railroad but
add a heavy refactored one as an other project
… On 2018-04-17, at 12:35, Niko Matsakis ***@***.***> wrote:
@willy610
The status of my product is stable and no bigger issues more than proper UNICOE/UTF processing remains.
Sorry for disappearing! After writing those messages, I got totally overwhelmed (and this week i'm actually on vacation, so I'll probably be slow to reply again.) I'm a bit confused about which project you are referring to -- do you mean the railroad diagrams you referenced here? If so, that seems like a cool visualization technique, but presumably it requires an EBNF grammar to start? At the moment, that last part is what I am most interesting in obtaining; note though that if we had a working LALRPOP grammar, it "desugars down" to a plain CFG internally, so we ought to be able to use your tool to visualize it.
@ehuss
The people working on the reference have been making great progress, and the up-to-date grammar...appears to be getting closer to complete
That's great! Is it presently being tested? And, if so, how?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
-----------------------------
Willy Svenningsson
Johannas Vag 28
S-425 42 Hisings Karra
Sweden
+46 0768 22 20 26
willy@fager.st
|
The grammar that's within reference (the markdown files) is way overdue for
an update and you are not the first to notice it. In fact -- this RFC
happened specifically because of it.
Replacing the grammar definitions in the reference markdown files with
something else (such as railroads) would be super awesome, but before we
can do that we need a complete grammar definition anyway.
On Thu, Apr 19, 2018, 00:10 Willy Svenningsson <notifications@github.com>
wrote:
… Hej Niko
The project I’m referencing is my project
#30942 (comment)
My focus is to keep the possibility to generate railroad view from
grammars of kind EBNF.
And have the verification function there (Unused, missing rules etc)
At the moment I’m investing some RUST grammar approaches
1. The old yacc from RUST source
#30942 (comment)
2. From some md files in the Rust nursery
https://github.com/rust-lang-nursery/reference/tree/master/src
3. From the work by Jason Orendorff. I think he said who wrote the grammar
in order to understand RUST
when writing the book Programming Rust. What a grammar and what a book!!
https://github.com/jorendorff/rust-grammar/blob/master/Rust.g4
4. and from nikos older? rusty-pop
https://github.com/nikomatsakis/rustypop/blob/master/src/rusty.lalrpop
5. but also LALRPOP
https://github.com/lalrpop/lalrpop
A. In most sources I struggled with character set descriptions and all the
regeexpr.
So I introduced, in my EBNF, 'Character Set Expressions' with set, union,
difference and range.
It will show up soon.
B. I think Mark Down ’marking’ is insufficient for a BNF.
Perhaps generating plain HTML snippets from rules.
They can be styled looking like program source. Color, font etc
C. EBNF decorated with types is a must. I’m not there yet
D. Actions in the rules like {} in yacc and => in LARPOP could
perhaps be specified in the EBNF as a (parametric) reference to other
source
like
pub Expr: Box<Expr> = {
Expr ExprOp Factor @action(ActionName,Expr),
Factor,
};
So the @ or similar unused sign could be an element in the syntax.
Clean up grammar from actions and relate them using ’@' instead.
The intersection of EBNF and LALRPOP could be ...
E. Or perhaps the best. Have a tool exporting from LALRPOP as EBNF (with
type)
And the continue with generation railroad and markdown files from that
source.
I’m really sorry for working mostly off road.
I will probably not update the above bnf2railroad but
add a heavy refactored one as an other project
> On 2018-04-17, at 12:35, Niko Matsakis ***@***.***> wrote:
>
> @willy610
>
> The status of my product is stable and no bigger issues more than proper
UNICOE/UTF processing remains.
>
> Sorry for disappearing! After writing those messages, I got totally
overwhelmed (and this week i'm actually on vacation, so I'll probably be
slow to reply again.) I'm a bit confused about which project you are
referring to -- do you mean the railroad diagrams you referenced here? If
so, that seems like a cool visualization technique, but presumably it
requires an EBNF grammar to start? At the moment, that last part is what I
am most interesting in obtaining; note though that if we had a working
LALRPOP grammar, it "desugars down" to a plain CFG internally, so we ought
to be able to use your tool to visualize it.
>
> @ehuss
>
> The people working on the reference have been making great progress, and
the up-to-date grammar...appears to be getting closer to complete
>
> That's great! Is it presently being tested? And, if so, how?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub, or mute the thread.
>
-----------------------------
Willy Svenningsson
Johannas Vag 28
S-425 42 Hisings Karra
Sweden
+46 0768 22 20 26
***@***.***
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#30942 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AApc0klOHg4m2tWYN9CZhGDnfrqNIprlks5tp6vSgaJpZM4HF_MM>
.
|
I don't know, I don't think there is anything formal set up. I think @brauliobz is doing most of the work, and he once mentioned that he was using Antlr4 to test. |
@nikomatsakis I'm sorry for the long overdue response.
I have a script that automatically scrapes large files from repos under the
I should have more free time in the coming months and I'd be happy to help on whatever you guys need the most help on (porting over a concrete grammar, generally improving LALRPOP, etc.) |
Triage: we have a |
Triage: we have a grammar WG working on the grammar now, and their work will end up in the reference. Is this issue still worth keeping open? |
@steveklabnik excellent. is there a way to keep up with how things are going with regards to the work the grammar wg is doing? |
@steveklabnik Since there is now a dedicated repository and nobody has responded since January 8, I guess we can close this issue now right? |
Yep! |
This is the tracking issue for rust-lang/rfcs#1331, which specifies a procedure for creating a canonical grammar apart from the compiler. This is a multi-phase process. I think the first step, honestly, is just to lay out a firm plan of how to proceed -- what kind of automatic testing to use and so forth. The RFC provides a general plan but it needs to be made more concrete. Once this issue has an owner, I (or they) can update this summary and try to keep it up-to-date.
The text was updated successfully, but these errors were encountered: