-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[.NET] Really slow parsing #63
Comments
Original comment by Laurent Wouters (Bitbucket: 557058:675792b6-d731-4823-9f7d-c6dfcb2df2b5, ). I've taken a look at the grammar and one possible explanation for large input file is the use of * and + operators in syntactic rules (not in terminals). As mentioned in https://cenotelie.fr/projects/hime/reference/lang-rules.html in section Repetition Operators, the parser generator will transform rules with those operators and inject variables to obtain a canonical form compatible with parsing algorithms. At runtime, those generated variables are recognized and the parser has logic to reconstruct what would be expected by the user according to the original grammar. For example
is re-written as
However, the produced AST will still look like a root labelled 'a' with a single child node labelled 'x' and one or more nodes labelled 'y'. This form is reconstructed at runtime and although the logic is quite performant for small cases, it can be expensive for huge input where the number of 'y' children in this case is very big. If this is indeed your case, one solution is to manually rewrite the grammar in the same manner to remove the use of the offending + (or *) operator. You will avoid the pathological case described above but the downside is the AST will have another form, perhaps slightly less readable. In your grammar, perhaps the base option would be to try to replace
with a recursive definition like
The AST then won't look flat like all global_expr at the root as before, but it should be way faster for a huge file with all expressions at the file's root (grammatically speaking). |
Original comment by David Aramant (Bitbucket: 557058:90c5a420-b317-4f31-bffe-8f5c9effcd9f, GitHub: davidaramant). Looks like you're right. Getting rid of the + for global expressions reduced the time to 7s. Interestingly when I removed the * for assignment expressions it increased to 8s, but I'll play with it some more. |
I just encountered the same issue when playing with the library for a simple rust based tree parser for the listops dataset. Applying the fix dropped the parsing time for the training set (90e3 lines) from 30 seconds to basically instant. However, the recursive form isn't as elegant as you noted. If you have time, could you link to the sections of the code which is doing the _genvar transformation/reconstruction? If I had to guess I would expect the slowdown to come from (either) the insertion and/or the reconstruction involving copies/clones, I would like to check it out and see if I can find a fix at some point. Thank you for open sourcing this very nice tool btw! |
Hi, yes this is a general problem. Here are links to the relevant locations in the implementation :
Basically, when constructing an AST with the use of a
we have the final rules:
Then, at parse-time, given the input
then we read the second
then we read the third
and finally at the end we reduce
This process is indeed not efficient when the |
Original report by David Aramant (Bitbucket: 557058:90c5a420-b317-4f31-bffe-8f5c9effcd9f, GitHub: davidaramant).
I've been trying out various parsing solutions for a hobby project and I'm having some troubles with Hime. I've got a parser working, but it's incredibly slow compared to other frameworks (>16s for Hime and <1s for ANTLR on a large input file). I'm not sure if there's a problem with the grammar that's causing (I'm fairly new to parser frameworks) this or some other issue.
The Hime grammar file is here. For reference, the spec for the format I'm parsing helpfully includes a grammar.
The text was updated successfully, but these errors were encountered: