-
Notifications
You must be signed in to change notification settings - Fork 5
iXML
Invisible XML proves to be a capable approach to providing for parsing a text-based format, in an XSLT environment, to produce a structure subject to further processing.
Grammar for a reduced LMNL "sawtooth" syntax, in iXML:
LMNL: (tag, text?)*, tag.
text: char+.
-char: ~["[";"{";"\"]; "\["; "\{"; "\\". { \ as escape character is also escaped so we can represent '\[' }
-tag: (start | end | empty).
start: -"[", gi?, ws?, annotation*, ws?, -"}".
end: -"{", gi?, ws?, annotation*, ws?, -"]".
empty: -"[", gi?, ws?, annotation*, ws?, -"]".
@gi: name, ("#", cc+)?.
@name: ic, cc*.
ic: [L].
cc: ic; ["0"-"9"]; "."; "_"; "-"; ":".
annotation: -"[", name?, -"}", -text?, ae.
-ae: -"{]".
-ws: (" "|#9|#d|#a)+. { SPACE TAB CR LF }
See https://johnlumley.github.io/jwiXML.xhtml for an iXML workspace.
to do: stress test for top level ambiguities, etc.
TBD - structured annotations, character references, PIs and comments ...
The text to be parsed must start and end with tags (start, end or empty) or a parse error is returned.
A number of issues must be intercepted at the next level by examining the result tree (see below) - this grammar produces only the rough inputs for deriving a LMNL model from the input text, as marked up.
Emits a format capable of casting into a range model, but it doesn't capture all of LMNL. In particular:
- no support for structured annotations, only flat 'values' as annotations
- only abbreviated annotation syntax is supported
- name characters are limited to A-Za-z
- item objects, processing instructions, comments, LMNL declaration and namespaces are not supported
- ambiguities related to tag ordering are prevented by forbidding tag-only overlap (when range A's end tag appears directly after, not before, B's start tag)
To be supported (in the LMNL model):
- overlapping ranges
- arbitrary range (type) names including declarative names
- empty ranges
- anonymous ranges and annotations
- 'self' overlap ("sibling rivalry")
Follows the grammar, delivers a parse
- is well-formed
- tagging all lines up, with no mismatches or missing tags
- with no end tags before the first text content or start tags after the end
- adjoining tagging is given in the order end, empty, start
- this prevents tag-only overlap from intruding on a simple processing model (where ranges may be ordered but not tags)
Note that assuming it is well-formed, even LMNL syntax that is not properly tagged can be rendered for display (to show errors).
A LMNL syntax transpiler could similarly produce a LMNL wf AST marked up with notices of errors as well as information produced by scanning - suitable also as input (when error free) to the full LMNL range model.
LMNL is valid when it conforms to a schema - rules such as
- which tags (range names or range type names) are permitted
- which annotations are recognized for which tags;
- cardinality constraints for ranges and their annotations
- nesting/overlap constraints - what is and is not permitted to overlap
- datatype restrictions (lexical and semantic) over annotations