Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New parser: event-based policy #414

Merged
merged 11 commits into from
May 8, 2024
Merged

New parser: event-based policy #414

merged 11 commits into from
May 8, 2024

Conversation

biojppm
Copy link
Owner

@biojppm biojppm commented Mar 26, 2024

Also:

Parser refactor

The parser was completely refactored (#PR414). This was a large and hard job carried out over several months, and the result is:

  • A new event-based parser engine is now in place, enabling the improvements described below. This engine uses a templated event handler, where each event is a function call, which spares branches on the event handler. The parsing code was fully rewritten, and is now much more simple (albeit longer), and much easier to work with and fix.
  • YAML standard-conformance was improved significantly. Along with many smaller fixes and additions, (too many to list here), the main changes are the following:
    • The parser engine can now successfully parse container keys, emitting all the events in the correct , but as before, the ryml tree cannot accomodate these (and this constraint is no longer enforced by the parser, but instead by EventHandlerTree). For an example of a handler which can accomodate key containers, see the one which is used for the test suite at test/test_suite/test_suite_event_handler.hpp
    • Anchor keys can now be terminated with colon (eg, &anchor: key: val), as dictated by the standard.
  • The parser engine can now be used to create native trees in other programming languages, or in cases where the user must have container keys.
  • Parsing performance improved (benchmark results incoming) from reduced parser branching.
  • Emitting performance improved (benchmark results incoming), as the emitting code no longer has to read the full scalars to decide on an appropriate emit style.

Strict JSON parser

  • A strict JSON parser was added. Use the parse_json_...() family of functions to parse json in stricter mode (and faster) than flow-style YAML.

YAML style preserved while parsing

  • The YAML style information is now fully preserved through parsing/emitting round trips. This was made possible because the event model of the new parsing engine now incorporates style varieties. So, for example:
    • a scalar parsed from a plain/single-quoted/double-quoted/block-literal/block-folded scalar will be emitted always using its original style in the YAML source
    • a container parsed in block-style will always be emitted in block-style
    • a container parsed in flow-style will always be emitted in flow-style
      Because of this, the style of YAML emitted by ryml changes from previous releases.
  • Scalar filtering was improved and is now done directly in the source being parsed (which may be in place or in the arena), except in the cases where the scalar expands and does not fit its initial range, in which case the scalar is filtered out of place to the tree's arena.
    • Filtering can now be disabled while parsing, to ensure a fully-readonly parse (but this feature is still experimental and somewhat untested, given the scope of the rewrite work).
    • The parser now offers methods to filter scalars in place or out of place.
  • Style flags were added to NodeType_e:
      FLOW_SL     ///< mark container with single-line flow style (seqs as '[val1,val2], maps as '{key: val,key2: val2}')
      FLOW_ML     ///< mark container with multi-line flow style (seqs as '[\n  val1,\n  val2\n], maps as '{\n  key: val,\n  key2: val2\n}')
      BLOCK       ///< mark container with block style (seqs as '- val\n', maps as 'key: val')
      KEY_LITERAL ///< mark key scalar as multiline, block literal |
      VAL_LITERAL ///< mark val scalar as multiline, block literal |
      KEY_FOLDED  ///< mark key scalar as multiline, block folded >
      VAL_FOLDED  ///< mark val scalar as multiline, block folded >
      KEY_SQUO    ///< mark key scalar as single quoted '
      VAL_SQUO    ///< mark val scalar as single quoted '
      KEY_DQUO    ///< mark key scalar as double quoted "
      VAL_DQUO    ///< mark val scalar as double quoted "
      KEY_PLAIN   ///< mark key scalar as plain scalar (unquoted, even when multiline)
      VAL_PLAIN   ///< mark val scalar as plain scalar (unquoted, even when multiline)
    
  • Style predicates were added to NodeType, Tree, ConstNodeRef and NodeRef:
      bool is_container_styled() const;
      bool is_block() const 
      bool is_flow_sl() const;
      bool is_flow_ml() const;
      bool is_flow() const;
    
      bool is_key_styled() const;
      bool is_val_styled() const;
      bool is_key_literal() const;
      bool is_val_literal() const;
      bool is_key_folded() const;
      bool is_val_folded() const;
      bool is_key_squo() const;
      bool is_val_squo() const;
      bool is_key_dquo() const;
      bool is_val_dquo() const;
      bool is_key_plain() const;
      bool is_val_plain() const;
    
  • Style modifiers were also added:
      void set_container_style(NodeType_e style);
      void set_key_style(NodeType_e style);
      void set_val_style(NodeType_e style);
    
  • Emit helper predicates were added, and are used when an emitted node was built programatically without style flags:
    /** choose a YAML emitting style based on the scalar's contents */
    NodeType_e scalar_style_choose(csubstr scalar) noexcept;
    /** query whether a scalar can be encoded using single quotes.
     * It may not be possible, notably when there is leading
     * whitespace after a newline. */
    bool scalar_style_query_squo(csubstr s) noexcept;
    /** query whether a scalar can be encoded using plain style (no
     * quotes, not a literal/folded block scalar). */
    bool scalar_style_query_plain(csubstr s) noexcept;
    

Breaking changes

As a result of the refactor, there are some limited changes with impact in client code. Even though this was a large refactor, effort was directed at keeping maximal backwards compatibility, and the changes are not wide. But they still exist:

  • The existing parse_...() methods in the Parser class were all removed. Use the corresponding parse_...(Parser*, ...) function from the header c4/yml/parse.hpp (link valid after this branch is merged).
  • When instantiated by the user, the parser now needs to receive a EventHandlerTree object, which is responsible for building the tree. Although fully functional and tested, the structure of this class is still somewhat experimental and is still likely to change. There is an alternative event handler implementation responsible for producing the events for the YAML test suite in test/test_suite/test_suite_event_handler.hpp.
  • The declaration and definition of NodeType was moved to a separate header file c4/yml/node_type.hpp (previously it was in c4/yml/tree.hpp).
  • Some of the node type flags were removed, and several flags (and combination flags) were added.
    • Most of the existing flags are kept, as well as their meaning.
    • KEYQUO and VALQUO are now masks of the several style flags for quoted scalars. In general, however, client code using these flags and .is_val_quoted() or .is_key_quoted() is not likely to require any changes.

New type for node IDs

A type id_type was added to signify the integer type for the node id, defaulting to the backwards-compatible size_t which was previously used in the tree. In the future, this type is likely to change, and probably to a signed type, so client code is encouraged to always use id_type instead of the size_t, and specifically not to rely on the signedness of this type.

Reference resolver is now exposed

The reference (ie, alias) resolver object is now exposed in
c4/yml/reference_resolver.hpp (link valid after this PR is merged). Previously this object was temporarily instantiated in Tree::resolve(). Exposing it now enables the user to reuse this object through different calls, saving a potential allocation on every call.

@biojppm biojppm force-pushed the newparser branch 3 times, most recently from 7fe1a7f to 6ebef32 Compare March 26, 2024 23:24
src/c4/yml/parse_engine.hpp Fixed Show fixed Hide fixed
src/c4/yml/reference_resolver.cpp Dismissed Show dismissed Hide dismissed
src/c4/yml/reference_resolver.cpp Fixed Show fixed Hide fixed
src/c4/yml/parse_engine.hpp Fixed Show fixed Hide fixed
src/c4/yml/parse_engine.hpp Fixed Show fixed Hide fixed
src/c4/yml/event_handler_tree.hpp Fixed Show fixed Hide fixed
src/c4/yml/event_handler_tree.hpp Fixed Show fixed Hide fixed
src/c4/yml/event_handler_tree.hpp Fixed Show fixed Hide fixed
src/c4/yml/event_handler_tree.hpp Fixed Show fixed Hide fixed
src/c4/yml/event_handler_tree.hpp Fixed Show fixed Hide fixed
@biojppm biojppm force-pushed the newparser branch 7 times, most recently from 3d6b928 to 9e75661 Compare March 27, 2024 19:20
Copy link

codecov bot commented Mar 27, 2024

Codecov Report

Attention: Patch coverage is 98.25480% with 40 lines in your changes are missing coverage. Please review.

Project coverage is 97.26%. Comparing base (620615f) to head (eaf9a24).
Report is 1 commits behind head on master.

Files Patch % Lines
src/c4/yml/tree.cpp 95.95% 8 Missing ⚠️
src/c4/yml/parse_engine.hpp 83.33% 7 Missing ⚠️
src/c4/yml/emit.def.hpp 97.87% 6 Missing ⚠️
src/c4/yml/tree.hpp 97.26% 4 Missing ⚠️
src/c4/yml/event_handler_tree.hpp 98.90% 3 Missing ⚠️
src/c4/yml/reference_resolver.cpp 98.10% 3 Missing ⚠️
src/c4/yml/detail/parser_dbg.hpp 93.54% 2 Missing ⚠️
src/c4/yml/filter_processor.hpp 99.09% 2 Missing ⚠️
src/c4/yml/tag.cpp 99.16% 2 Missing ⚠️
src/c4/yml/common.cpp 87.50% 1 Missing ⚠️
... and 2 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #414      +/-   ##
==========================================
+ Coverage   96.75%   97.26%   +0.50%     
==========================================
  Files          22       33      +11     
  Lines        8449    10924    +2475     
==========================================
+ Hits         8175    10625    +2450     
- Misses        274      299      +25     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@biojppm biojppm mentioned this pull request Mar 27, 2024
@biojppm biojppm force-pushed the newparser branch 2 times, most recently from f5b0363 to f511768 Compare March 28, 2024 01:16
@biojppm biojppm force-pushed the newparser branch 3 times, most recently from e0ea012 to 667fafd Compare March 30, 2024 21:53
src/c4/yml/reference_resolver.cpp Dismissed Show dismissed Hide dismissed
src/c4/yml/tree.cpp Dismissed Show dismissed Hide dismissed
src/c4/yml/event_handler_tree.hpp Fixed Show fixed Hide fixed
src/c4/yml/event_handler_tree.hpp Fixed Show resolved Hide resolved
src/c4/yml/common.hpp Dismissed Show dismissed Hide dismissed
src/c4/yml/event_handler_tree.hpp Dismissed Show dismissed Hide dismissed
rewrite parser based on events, rewrite filtering
separating the filter code to a different class
wip
wip
wip
filter single quoted is working
refactor to filter processor wip
double quoted wip
double quoted wip
double quoted wip
double quoted wip
double quoted wip
double quoted wip
double quoted wip
double quoted seems to be working
double quoted wip
double quoted wip
double quoted wip
double quoted wip
double quoted wip
double quoted wip
double quoted working!
filter plain scalar wip
wip
filter plain scalar wip
wip
test filter processors
fix write in inplace::translate_esc
block literal wip
block literal wip
block literal wip
block literal wip
block literal wip
block literal wip
block literal working!
filter block folded wip
filter block folded wip
cleanup filter
filter locations are needed only for double quoted scalars
add FilterResult to encapsulate validity
prepare filter for using in parser
in-parser filtering wip
filter empty block literals
filter block folded ok
all filters working
moving filters to parse wip
fix block_folded
fixing block folded WIP
new filter: all tests passing!
fix sanitizer issues
refactor: harmonize parser filtering function names
wip ci fixes
coverage wip
filter arena no longer needed
double quoted filter wip
fix wip
fix wip
fix wip
wip: inplace mid-extending vs end-extending
all tests ok
wip
wip
wip2
wip
wip
wip doc
wip doc
wip anchor
fix newlines in emit of docs
wip ref
wip new parser
wip new parser
wip new parser
fix
wip new parser
wip new parser
wip new parser
wip new parser
wip new parser: tag directives
wip new parser: tag resolving
wip new parser: more sink edge cases
wip new parser: key containers working in the sink
prepare event sink stack
tree parse wip
cleanup event sink
tree parse wip
tree parse wip
tree parse wip
tree parse wip: now parsing simple flow seqs!
new parser wip: flow seqs: added anchor/ref parsing
new parser wip: seq flow goes on while there is a seq flow
new parser wip: seqimap events
new parser wip: seqimap parsing
new parser wip: now parsing flow maps!
wip
wip
new parser wip: block seqs wip
new parser wip: block maps wip
wip
wip
wip
map anchors ok
tags wip
anchors and tags now working
add tests for container keys
structure wip
key containers: working in events from yaml!
wip
wip
docs wip
qmrk wip
qmrk seq blck
qmrk wip
fix seqimap again
qmrk with tags
doc wip
doc wip
doc wip
doc wip
doc wip
doc wip
remove old parsing functions
fix
wip buffered events for container keys
ditto
ditto
ditto
ditto
container keys seem to be working
report error for container keys
flow key containers inside qmrk
remove unused functions
remove more unused functions
comments
wip
comments wip
wip
wip
wip
wip
most tests working
fix more tests
wip: refactor parser to not depend on tree
ditto
remove include dependencies
parser: do not use tree directly
fixes
fix annotations when starting child maps
more fixes
more fixes
more fixes
more fixes
block scalars
block scalars
fixes to scalars
wip
wip
wip
wip
add error location checks
wip
wip
sudden docs
sudden docs wip
sudden docs in block map/seq
first test cases for simple seq are working!
fixing test cases WIP
mark doc only on explicit docs or stream children
more progress
wip
wip
fixing indentless seqs wip
simple seqs are working!
nested_seqx2 working!
disable all un-refactored tests
fix empty_seq
fix empty map/file
empty scalar wip
fix empty scalars
fix test number
fix null vals and empty scalars
fix nested seq
map wip
map wip
fix maps!
fix nested maps!
fix map of seq
fix seq of map
fix sets
explicit key WIP
explicit key WIP
explicit key WIP
explicit key WIP
explicit keys working!
fix regressions
fix generic map seq tests
docs WIP
docs + indentation wip
remove unused functions
fix regressions
rename test_new_parser to test_parser_engine
docs working!
fix json
fix scalar names
anchors wip
anchors wip
anchors wip
anchors mostly working
anchors WIP
anchors/refs working!
move test lib files to a separate folder
tags wip
simple seq
simple seq
tag wip
tags working!
rename TestCase->TestCaseNode, into separate files
remove empty var
fix indentation
fix github_issues
fix github issues
single quoted wip
single quoted wip
single quoted is working!
double quoted wip
double quoted wip
fix plain scalar emit
literal scalar wip
literal scalar wip
literal scalar wip
literal scalar wip
literal scalar wip
move tags to separate source files
minor cleanup
block literal wip
block literal wip
add json parser
update benchmarks
improve json
fix compilation in clang
fix bm_emit
block literal wip
block literal wip
block literal wip
reference resolver
block literal wip
block literal working!
fix regressions
block folded wip
block folded wip
block folded wip
block folded wip
block folded wip
block folded wip
block folded wip
block folded wip: indented blocks
block folded wip
block folded wip
block folded wip
block folded working!
plain scalar wip
plain scalar wip
plain scalar working!
style wip
style wip
style wip
style wip
style WIP
scalar style wip
scalar style ok
fix regression of scalar plain
fix regression of double quoted wip
block literal wip (old)
double quoted wip
fix regression in double quoted
fix merge
add tests for merge
fix merge wip
fix vs compilation wip
parse overloads wip
parse overloads wip
parse overloads
fix merge for styles
fixes to quickstart wip
enable serialize test
improve test merge
fix test serialize
test tree wip
fix locations
test tree wip
test parser wip
fix test for yaml events (from tree)
refactor yaml event tests to use parameterized tests
event tests: use the scalar style information from the tree
event tests: use the container style information from the tree
event tests: working both from parser and tree
improve tag errors
fix tags wip
fix tags
fix bm
fix bm
fix test parser
fix tree wip
fix quickstart wip
fix test tree wip
fix some valgrind warnings
fix quickstart wip
fix tree & quickstart wip
fix docmaps with keyref as the first child
fix parsing into existing nodes
fix quickstart!
more fixes (~regressions from quickstart)
fix tool tests
fix test suite wip
fix test suite wip @215/1633
fix test suite wip @152/1633 91%
disable tests with container keys: 96/1633  94%
test suite wip
test suite parse: update missing errors
fix parsing of scalars starting with ?
fix skipping of whitespace in flow mode 47/1633 97%
fix missing anchor 45/1633 97%
fix neutral tag resolve 43/1633 97%
fix parse of yaml events 39/1633 98%
fix tags normalization 50/1633 97%
fix tags normalization 38/1633 98%
fix scalar with trailing colon : 36/1633 98%
exempt more missing errors. 32/1633 98%
30/1633 98%
22/1633 99%
18/1633 99%
backspace in dquo. 16/1633 99%
8/1633 99%
7/1633 99%
6/1633 99%
3/1633 99%
100% pass!
adding events parser to test suite and events tool
sneaky block container keys WIP
cleanup yaml-events
fix warning
wip
fix block key containers
test suite: fix event emitting WIP
100% tests pass!
fix missing doc UKK6
test suite: add tests comparing reference events and emitted events WIP
test suite: fix comparison of emitted events
100% test pass
enable tests for key containers. 100% pass!
enable error tests for event emitter. 100% pass!
update test suite exclusions
[refac] split event handlers
[fix] compilation in windows
windows exports
fix wip
wip
wip
wip
tab tokens working!
fix NodeType::operator== ambiguity in C++20
clean up test names
cover json as much as possible in the tests
fix the difficult failure in vs-x86-release builds
ensure json is tested in the test groups
fix some problems with the declaration/definition of test groups
minor cleanup in json emit
parser cleanup wip
cleanup and improve coverage
cleanup and improve coverage
cleanup and improve coverage
cleanup and improve coverage
wip cleanup and coverage
wip cleanup and coverage
style is no longer tagged WIP
tidy style API
ensure tree assertions go through the tree's callbacks
style API
bm wip
bm wip
changelog
tidy type+style predicates
add id_type to take place as the new type for node ids
update benchmarks
WIP fix warnings when the id_type is signed 32 bit
wip
wip [ci skip]
woops
wip [ci skip]
add test to ensure #422
fix rebase problem
fix noderef tests which were optimized
github workflows: update checkout version
add some more plain scalar tests
add yamlscript like test
quickstart: call sample_tags/directives on the proper place
add test for 379
update docs post rebase
fix rebase problems and update docs
test parse engine: fix gcc4.8 not accepting C++11 raw strings as macro args
investigating gcc x86 release failures
fix gcc x86 release failures (?)
gcc x86 release failures: cleanup print
update c4core
update swig interface
fix benchmark workflow
improve coverage
improve error logging functions
annotate unreachable to prevent error in visual studio
improve coverage
split event stack wip
split event stack wip
split event stack wip
tidy up some defines, and improve the dump function
emit: disable uncovered statements
@biojppm biojppm changed the title Newparser WIP New parser: event-based policy May 5, 2024
@biojppm biojppm force-pushed the newparser branch 3 times, most recently from 8ce0671 to 91ecfd6 Compare May 6, 2024 00:24
@biojppm biojppm merged commit 6e396b2 into master May 8, 2024
246 checks passed
@biojppm biojppm deleted the newparser branch May 8, 2024 08:21
biojppm added a commit that referenced this pull request May 18, 2024
biojppm added a commit that referenced this pull request May 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant