PEG syntax considerations #17

jontxu · 2023-10-18T12:04:29Z

(I added @gmantele and @msdemlei as reviewers for the time being, feel free to add whomever you feel adequate.)

Hi all,

as PEG adoption for the ADQL grammar definition draws near(er), I have been testing different parser generators recently. Namely:

Mouse, written in Java.
peg(1), written in C.

I discovered by trial and error, that their definition of PEG and its syntax differ from the one defined by Bryan Ford in the original paper (cf. Figure 1 and Table 1). In other words, the original syntax looks like a formal definition rather than a standard one -- there doesn't seem to be any, actually. Other tools much more idiomatic tools exist: i.e., fitting the respective programming language.

These deviations range from aesthetical changes (separators and delimiters) to extensions and in some cases syntactic sugar -- you can see them on the links above. To give a couple of examples:

Mouse only allows rules to be defined in camelCase, due to the underscore character _ being an extra operator.
Mouse having issues with character classes containing multiple groups, that is: the formal rule <- [a-zA-Z] becoming rule = [a-z] / [A-Z]
Negation: peg(1), allows character classes to be negated with [^s], the Mouse counterpart being ^[s], in both cases instead of the more formal ![s]

In addition, both also define the first rule as the root of the parse tree... unlike arpeggio used by lyonetia.

I think that, for completeness purposes, following the formal definition would be the sensible decision, eventually adding some clarifications to deployers regarding the disparities between different generators and optionally offer some tools for conversion.

In this case, I have updated the adql2.1.peg file to reflect the syntax as defined by Bryan Ford, and made the required changes to the testadql.py file so the conversion to the arpeggio compliant PEG syntax does not fail. Sadly, tests fail, which we would need to check.

Finally, other errors encountered by the parsers are:

~~The rule numeric_primary is redefined~~ fixed: see 486116d
The rules ANY_CHAR, unsigned_hexadecimal, numeric_expression, string_expression, string_function, bitwise_expression, geometry_function are defined but not used

Breaks arpeggio tests.

Zarquan · 2023-10-18T12:29:10Z

For reference, Brian Ford's paper has a DOI https://doi.org/10.1145/982962.964011.

Zarquan · 2023-10-18T12:33:01Z

I agree in the absence of a standard, it is probably better to stay close with the original paper rather than an implementation specific variant.

Given that we chose PEG because it is machine readable, would it be difficult to write a conversion tool that took a BrianFord compatible grammar and converted it into a form that the other tools could read ?

jontxu · 2023-10-18T13:05:28Z

Given that we chose PEG because it is machine readable, would it be difficult to write a conversion tool that took a BrianFord compatible grammar and converted it into a form that the other tools could read ?

Yes... but it can get sort of tricky, as it depends on the tool. Take into account that we are talking about a small subset of the tools mentioned on Bryan Ford's page. The more idiomatic ones rely on the language itself and have got functions like or(...), any(...), not(...) and such which I assume are then consumed by the packrat parsers.
What we should aim for, I think, is to encourage people to avoid (whenever possible) using such idiomatic tools -- in a way that any update to the grammar does not involve a major change in their respective source codes. Or, additionally, having an appendix with a list of conversors: some IVOA-endorsed, others provided by the deployers themselves.

As for the character groups, as said above, Mouse seems to favour the precedence operator when multiple sets are involved (when the opposite is actually recommended). For arpeggio, the same character sets are defined as python regexes. Luckily, the grammar included in the pull requests works out-of-the-box with peg(1)... but it has been unmaintained since 2017 -- or at least no new releases have happened since.

In short, given a rule hex_digit <- [0-9a-fA-F]:

peg(1) / formal PEG: hex_digit <- [0-9a-fA-F]
arpeggio: hex_digit <- r'[0-9a-fA-F]'; (I am not entirely sure whether the r is necessary in this case, to be fair)
Mouse: hexDigit = [0-9] / [a-f] / [A-F];

By the way, I actually forgot to add one of the dealbreakers (so to speak) for Mouse, where it uses the quotation marks and apostrophes in the same way Java does: 's' for a character, "String" for a string.

msdemlei

Thanks for tackling these. Some minor nits, and perhaps more seriously: how can this grammar work without numeric_primary?

src/peg/adql2.1.peg

msdemlei · 2023-10-18T14:36:59Z

src/peg/adql2.1.peg

@@ -482,10 +477,6 @@ user_defined_function <-
 	(_ value_expression (_ ',' _ value_expression)* _)?
 	')'

-numeric_primary	<-
-	value_expression_primary
-	/ numeric_value_function


Hu? How can this disapear without breaking the grammar?

It was duplicated for some unknown reason: line 485 and line 558.

msdemlei · 2023-10-18T14:39:00Z

src/peg/testpeg.py

+    peg_rules = re.sub("\\[", "r'[", peg_rules)
+    peg_rules = re.sub("\\!r'\\[", "r'[^", peg_rules)
+    peg_rules = re.sub("\\]", "]'", peg_rules)
+    peg_rules = re.sub("EOF <-[^;]*;", "", peg_rules)


I'm not sure I'm too enthused by hacking the grammar syntax with regular expressions. For now, that's probably all right, but we should have a plan to parse our grammar into a tree and then serialise it in the various syntaxes, I guess. I just wonder if someone else might not already have written something like that...

That's indeed a very interesting idea and much stable/reliable than regular expressions, though apparently this is enough and allowed us to quickly test this grammar with different parsers.

The hacking is to fit arpeggio, which asks for regular expressions: documentation on grammars.

Regex matches are given as strings with prefix r (e.g. r'\d*\.\d*|\d+').

We'd need to think about an alternate toolset which doesn't rely on them to test the ADQL examples.

gmantele · 2023-10-18T14:49:11Z

As discussed privately, there is another parser generator called Canopy that can generate a parser in 4 languages: Java, Javascript, Python and Ruby.

jontxu · 2023-10-18T15:07:32Z

As discussed privately, there is another parser generator called Canopy that can generate a parser in 4 languages: Java, Javascript, Python and Ruby.

To add to this, the major differences with the formal PEG grammar are:

A declaration in the first line with the format grammar <Name>.
Underscore characters needing whitespaces around non-identifier characters: _) becomes _ ), _' becomes _ ' and such.

I was successfully able to generate parsers on all languages with these changes.

EDIT: after the generated python parser complained about reserved keywords and fixing the PEG grammar accordingly (locally), it now returns a IndentationError: too many levels of indentation error.

msdemlei

Let's merge

jontxu · 2023-10-19T09:03:20Z

Merge done.

Although I think this should be discussed further elsewhere: github issue, mailing list...

jontxu added 2 commits October 18, 2023 13:16

Adapt PEG syntax to the one defined by Bryan Ford on the original paper.

bd7e5c4

Breaks arpeggio tests.

Adapt classical PEG syntax to arpeggio equivalent. Some tests fail

7524aa9

jontxu requested review from msdemlei and gmantele October 18, 2023 12:04

jontxu force-pushed the peg-considerations branch from bec6712 to 486116d Compare October 18, 2023 13:33

Remove duplicated rule numeric_primary

486116d

msdemlei requested changes Oct 18, 2023

View reviewed changes

Fix typo

216a3d8

msdemlei approved these changes Oct 19, 2023

View reviewed changes

gmantele approved these changes Oct 19, 2023

View reviewed changes

jontxu merged commit ecfeb21 into ivoa:master Oct 19, 2023
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PEG syntax considerations #17

PEG syntax considerations #17

jontxu commented Oct 18, 2023 •

edited

Loading

Zarquan commented Oct 18, 2023

Zarquan commented Oct 18, 2023

jontxu commented Oct 18, 2023 •

edited

Loading

msdemlei left a comment

msdemlei Oct 18, 2023

jontxu Oct 18, 2023 •

edited

Loading

msdemlei Oct 18, 2023

gmantele Oct 18, 2023

jontxu Oct 18, 2023

gmantele commented Oct 18, 2023

jontxu commented Oct 18, 2023 •

edited

Loading

msdemlei left a comment

jontxu commented Oct 19, 2023

PEG syntax considerations #17

PEG syntax considerations #17

Conversation

jontxu commented Oct 18, 2023 • edited Loading

Zarquan commented Oct 18, 2023

Zarquan commented Oct 18, 2023

jontxu commented Oct 18, 2023 • edited Loading

msdemlei left a comment

Choose a reason for hiding this comment

msdemlei Oct 18, 2023

Choose a reason for hiding this comment

jontxu Oct 18, 2023 • edited Loading

Choose a reason for hiding this comment

msdemlei Oct 18, 2023

Choose a reason for hiding this comment

gmantele Oct 18, 2023

Choose a reason for hiding this comment

jontxu Oct 18, 2023

Choose a reason for hiding this comment

gmantele commented Oct 18, 2023

jontxu commented Oct 18, 2023 • edited Loading

msdemlei left a comment

Choose a reason for hiding this comment

jontxu commented Oct 19, 2023

jontxu commented Oct 18, 2023 •

edited

Loading

jontxu commented Oct 18, 2023 •

edited

Loading

jontxu Oct 18, 2023 •

edited

Loading

jontxu commented Oct 18, 2023 •

edited

Loading