Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate if ECJ's parser be simplified by incorporating overlooked last minute spec revisions #1045

Open
srikanth-sankaran opened this issue May 10, 2023 · 9 comments
Assignees

Comments

@srikanth-sankaran
Copy link
Contributor

Per observations from @stephan-herrmann here: #125 (comment)
we may have inadvertently overlooked last minute changes to Java 8 spec that may have allowed us to simplify the Parser/Scanner in ECJ.

See

https://bugs.eclipse.org/bugs/show_bug.cgi?id=561934#c4
https://bugs.eclipse.org/bugs/show_bug.cgi?id=561934#c10

@trancexpress
Copy link
Contributor

Thanks for looking into this @srikanth-sankaran !

@stephan-herrmann
Copy link
Contributor

In an attempt to clarify what can/cannot be done I simplified the Java grammar to a minimum that still shows a remaining conflict:

mini2.txt
Rename this file to mini.g and pass it to jikespg, it will show the dreaded

*** Shift/reduce conflict on "MULTIPLY" with rule 2

This grammar is not LALR(1).

In all the rest of the Java grammar, productions are chained in a way that anything on the RHS always has a reduced set of possibilities, ensuring that the right-most non-terminal will never branch back to the current LHS.

In the reduced version Cast ::= LParent Identifier RParent Lambda violates that rule, because Lambda may end with a generic Expr.

As a result this expression is ambiguous:

(Type) () -> a * b

The parser cannot decide which way to interpret (denoted by resulting AST):

  • Mul ( Cast ( Lambda (Identifier)), Identifier )
  • Cast ( Lambda ( Mul (Identifier, Identifier) ) )

IOW, this mini grammar (and transitively the Java grammar) cannot decide if the trailing *b binds to the inner expression a or to the entire expression so far (Type) () -> a.

I couldn't yet find documentation of all of those options of jikespg, if something there allows to define precedence to the "inner" rule?

Semantically, multiplying a cast with some factor normally doesn't make sense, but when casting to a numerical boxing type this may still do something.

Casting a lambda, however, will never produce a numerical type, as none of those are functional interfaces, not sure if this can be turned into something useful at the syntax level ...

@stephan-herrmann
Copy link
Contributor

stephan-herrmann commented Dec 17, 2023

The following change would fix the mini grammar, to be LALR(1):

--- mini.g
+++ mini.g
@@ -37,6 +37,7 @@
 
 Expr -> Mul
 Expr -> Lambda
+Expr -> CastL
 
 Mul -> Unary
 Mul ::= Mul MULTIPLY Unary
@@ -45,7 +46,7 @@
 Unary -> Cast
 
 Cast ::= LParen Identifier RParen Primary
-Cast ::= LParen Identifier RParen Lambda
+CastL ::= LParen Identifier RParen Lambda
 
 Primary -> Identifier

Explanation: I pulled the casted-lambda rule out of the entire tree of productions applying some kind of arithmetic, assuming that you simply cannot "compose" a (casted) lambda with any other value.

Would this illegally enable use of casted-lambda in a context where it's illegal? I think that could be compensated by either of

  • find a more suitable nonterminal to hook into
  • detect illegal use later during compilation

Indeed also the actual java.g can be disambiguated using this trick 🎉 😄 ✨

PS: I could imagine that such change would even be worth pushing upstream, i.e., to JLS.

@stephan-herrmann
Copy link
Contributor

stephan-herrmann commented Dec 17, 2023

Java is funny.

A casted lambda expression can actually be used as an operand in certain expressions.

Need to check if we break anything regarding:

  • instanceof
  • != and == comparisons
  • ternaries
  • assignment expressions

E.g.:

  	if (((Supplier<String>) () -> a) instanceof Supplier)
	    System.out.print("yes");
  	if (((Supplier<String>) () -> a) != null)
	    System.out.print("yes");

This is accepted by javac and prints "yesyes" :)

From experiments in this area I learned that instanceof normally binds to the inner expression, to bind it to the lambda parens are needed. This could mean that these rules will never directly branch into (casted) lambda but need ( Expression ) to get there, in which case the parens will disambiguate.

Actually, ecj barfs on the above:

1. ERROR in /tmp/X.java (at line 13)
        if (((Supplier<String>) () -> "a") != null)
                                ^^^^^^^^^
Missing code implementation in the compiler

Let's hope that grammar simplification will avoid this unspeakable error message, and causes no further havoc. In fact I wonder if anyone would insist that ecj supports checking a lambda for null, or performing instanceof checks. In both cases the lambda will be left to gc before it can ever be invoked.

@srikanth-sankaran
Copy link
Contributor Author

if (((Supplier<String>) () -> a) instanceof Supplier)
	    System.out.print("yes");
  	if (((Supplier<String>) () -> a) != null)
	    System.out.print("yes");

I have raised #1755

@srikanth-sankaran
Copy link
Contributor Author

Thanks for the various experiments and the suggestions Stephan, I will study these. I am missing my Dragon book today, will have it in my hands tomorrow.

@stephan-herrmann
Copy link
Contributor

Genie reminded me that we have some unfinished work still in gerrit:

Seeing mention of fallBackToSpringForward() and friends, I guess we want to wait for the grammar simplification, before polishing this code area. Perhaps some of the magic will no longer be needed at some point :)

@srikanth-sankaran
Copy link
Contributor Author

In an attempt to clarify what can/cannot be done I simplified the Java grammar to a minimum that still shows a remaining conflict:

mini2.txt Rename this file to mini.g and pass it to jikespg, it will show the dreaded

*** Shift/reduce conflict on "MULTIPLY" with rule 2

This grammar is not LALR(1).

In all the rest of the Java grammar, productions are chained in a way that anything on the RHS always has a reduced set of possibilities, ensuring that the right-most non-terminal will never branch back to the current LHS.

In the reduced version Cast ::= LParent Identifier RParent Lambda violates that rule, because Lambda may end with a generic Expr.

As a result this expression is ambiguous:

(Type) () -> a * b

The parser cannot decide which way to interpret (denoted by resulting AST):

  • Mul ( Cast ( Lambda (Identifier)), Identifier )
  • Cast ( Lambda ( Mul (Identifier, Identifier) ) )

IOW, this mini grammar (and transitively the Java grammar) cannot decide if the trailing *b binds to the inner expression a or to the entire expression so far (Type) () -> a.

I couldn't yet find documentation of all of those options of jikespg, if something there allows to define precedence to the "inner" rule?

I don't know either - but I am not averse to, (at some point) - if such an options are not available, building them on an experimental basis to see where that would lead us. IIRC, YACC allows you to say precedence between two rules for shift/reduce (and where no precedence is specified always prefers shift) and for reduce/reduce it always opt for the earlier listed rule.

Semantically, multiplying a cast with some factor normally doesn't make sense, but when casting to a numerical boxing type this may still do something.

Casting a lambda, however, will never produce a numerical type, as none of those are functional interfaces, not sure if this can be turned into something useful at the syntax level ...

@srikanth-sankaran
Copy link
Contributor Author

Two big blockers for certain types of experiments in the area of parser/scanner are:

  1. The DiagnoseParser with its too much voodoo in the works. The fact that there are multiple machines that are run in parallel with too much of the automaton's innards and tables being manipulated and more importantly, the fact that tokens are read ahead of time and buffered into a stream mean that we can never have a satisfactory implementation of org.eclipse.jdt.internal.compiler.parser.diagnose.DiagnoseParser.automatonWillShift(int) (that can be useful for the scanner)

This method is a fantastic side-effect free mechanism to query the parser of its current state - It is axiomatic that an LALR parser will never shift on invalid input - so this is a great tool to steer the scan and the parse.

Presently this method is used to disambiguate when when is a contextual keyword vs a plain old identifier. Ask the parser if it will shift the keyword when and if it answers yes, then the when in the input stream is a contextual keyword and otherwise otherwise.

It is possible to build this disambiguation into DiagnoseParser's internal token stream but that is some work.

  1. CompletionParser which works with incomplete evolving, broken code.

In general building smarts falls in some buckets.

(a) Feedback from the parser to the scanner: this allows consumed input to steer classification of unconsumed input. This is looking back in the rear view mirror. (b) Looking ahead to detect structure of what is to come to classify current token (perhaps inject some synthetics to steer the parser in a particular path) - this is Vanguard parser's forte.
and
(c) The now abandoned String Templates work had a brand new way of dealing conflicts in the grammar - by making some non-terminals symbols (with their own grammatical structure) appear as (pseudo) terminal symbols. Unfortunately before we could gain good hands on experience with that model, we had to rip it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants