-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Properly manage lists #9
Comments
@promethe42: it would be great if you can study it and we discuss Also, I had to manage this same issue for alineas in |
… instead of one 'container' child with multiple children Previously when some rule was found multiple times (e.g. quote), ToSemanticTreeVisitor created an untyped container with multiple children, which was likely to create errors because first it was untyped (which is unusual), second because no DuraLex visitor expect to find such container with multiple children but expect to find directly multiple children, and third it is ugly. On the other side, ToSemanticTreeVisitor needs a DuraLex container node when a Parsimonious node contains multiple DuraLex nodes during half-read operations. Now with this commit, DuraLex container nodes are collected as soon as an upper DuraLex node is able to contain them, and in this case the children of this container node are directly put as children of the DuraLex parent node. In the case the parent Parsimonious node is still not a DuraLex node all container nodes and non-container nodes are flatten in a container node node. A (welcomed) side-effect of this new behaviour is that there is no more danger to see container-of-containers DuraLex nodes given they are always flatten. At the upper side, when a local DuraLex tree is attached to the global DuraLex tree, container nodes are again flatten. With this mechanism, the global DuraLex tree should never have any container nodes, they should only remain in half-read trees in ToSemanticTreeVisitor. To be still able to continue parsing even if it remains container nodes, they now have the type 'parsimonious-list-container'. Obviously if such nodes are found, ToSemanticTreeVisitor will need to be debugged. As a result of this operation, the DuraLex trees created by this visitor are a bit different than previous one in the case of multiple values. For instance the following amendment: À l’article L100-3 du code de l’énergie est ajouté deux phrases ainsi rédigées : "Phrase 1." "Phrase 2." was previously the following DuraLex tree: type: edit; editType: replace ╠═ type: code-reference; id: code de l'énergie ║ ╚═ type: article-reference; id: L100-3 ╠═ type: sentence-reference ║ ╚═ type: quote; words: Phrase 1. ╚═ type: sentence-definition ╚═ type: quote; words: Phrase 2. and it is now: type: edit; editType: replace ╠═ type: code-reference; id: code de l'énergie ║ ╚═ type: article-reference; id: L100-3 ╚═ type: sentence-reference; count: 2 ╠═ type: quote; words: Phrase 1. ╚═ type: quote; words: Phrase 2. I find it is even more meaningful given the number of sentences matches the number of quotes. (I know that in some amendments both sentences are grouped in the same quote, I almost find they could be splitted by a further visitor, to be discussed and evaluated.) From a reusing point of view (in SedLex, both could be easily read (currently SedLex only read the last one in both cases, I will push shortly a light change so that SedLex will always show all quotes). Issue: #9
About the recent change with lists. We are now with 13 failing tests instead of 23. Also, added a property 'count' with value 1 when it is known that there is one alinea or one sentence. Also converted some u'' to '' in tests given we are Python3; and reordered some of the properties in tests, I find a canonical order 'type', then other properties then 'children' is easier to read. Issue: #9
The conversion to PEGs becomes difficult because we have to choose a (good) design to manage lists and combine it (properly) with hierarchical items. I mean expressions like "Au deuxième alinéa, à la troisième phrase du quatrième alinéa". A good real crash test is http://www.assemblee-nationale.fr/15/textes/0911.asp#D_Article_11.
Legacy DuraLex creates this tree:
The first version of ToSemanticTreeVisitor (now in its own file) creates this tree:
With b2f13ab I try a new design for ToSemanticTreeVisitor, it works at small scale given my experiments, it creates:
There is currently a small (easy to solve) issue because it creates an untyped container node (it shouldn’t for a sigle child). I didn’t try at a larger scale.
I have no precise idea if this design is a good design. The difficulty is to arbitrate between creating a flat list and/or a hierarchical tree. Another possible design would be to take into account the canonical hierarchy (word < sentence < alinea < article) during merge operation of child (Parsimonious) nodes.
I think a somehow good hierarchy is needed during parsing, before any DuraLex visitor since they cannot re-create some missing information. But probably some visitors will need to be adapted to take into account both flat lists and hierarchical items.
The text was updated successfully, but these errors were encountered: