Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly manage lists #9

Open
Seb35 opened this issue Jul 29, 2018 · 1 comment
Open

Properly manage lists #9

Seb35 opened this issue Jul 29, 2018 · 1 comment

Comments

@Seb35
Copy link
Member

Seb35 commented Jul 29, 2018

The conversion to PEGs becomes difficult because we have to choose a (good) design to manage lists and combine it (properly) with hierarchical items. I mean expressions like "Au deuxième alinéa, à la troisième phrase du quatrième alinéa". A good real crash test is http://www.assemblee-nationale.fr/15/textes/0911.asp#D_Article_11.

Legacy DuraLex creates this tree:

{"type": "alinea-reference", "order": 2, children: [
  {"type": "sentence-reference", "order": 3, children: [
    {"type": alinea-reference", "order": 4}]}]}

The first version of ToSemanticTreeVisitor (now in its own file) creates this tree:

{"type": "alinea-reference", "order": 2},
{"children": [
  {"type": "sentence-reference", "order": 3},
  {"type": "alinea-reference", "order": 4}]}

With b2f13ab I try a new design for ToSemanticTreeVisitor, it works at small scale given my experiments, it creates:

{"type": "alinea-reference", "order": 2},
{"type": "alinea-reference", "order": 4, "children": [
  {"children": [{"type": "sentence-reference", "order": 3}]}]}

There is currently a small (easy to solve) issue because it creates an untyped container node (it shouldn’t for a sigle child). I didn’t try at a larger scale.

I have no precise idea if this design is a good design. The difficulty is to arbitrate between creating a flat list and/or a hierarchical tree. Another possible design would be to take into account the canonical hierarchy (word < sentence < alinea < article) during merge operation of child (Parsimonious) nodes.

I think a somehow good hierarchy is needed during parsing, before any DuraLex visitor since they cannot re-create some missing information. But probably some visitors will need to be adapted to take into account both flat lists and hierarchical items.

@Seb35
Copy link
Member Author

Seb35 commented Jul 29, 2018

@promethe42: it would be great if you can study it and we discuss
@mdamien: FYI

Also, I had to manage this same issue for alineas in metslesliens. It was a bit easier because the space is smaller than in DuraLex. I solved it with a mechanism accumulate-collect in each scale, and the depth of this structure is 3. The rules are defined here (with their behaviour related to other rules) and the Parsimonious visitor is here.

Seb35 added a commit that referenced this issue Dec 31, 2018
… instead of one 'container' child with multiple children

Previously when some rule was found multiple times (e.g. quote),
ToSemanticTreeVisitor created an untyped container with multiple children,
which was likely to create errors because first it was untyped (which is
unusual), second because no DuraLex visitor expect to find such container
with multiple children but expect to find directly multiple children, and
third it is ugly.

On the other side, ToSemanticTreeVisitor needs a DuraLex container node when
a Parsimonious node contains multiple DuraLex nodes during half-read
operations.

Now with this commit, DuraLex container nodes are collected as soon as an
upper DuraLex node is able to contain them, and in this case the children
of this container node are directly put as children of the DuraLex parent
node. In the case the parent Parsimonious node is still not a DuraLex node
all container nodes and non-container nodes are flatten in a container node
node. A (welcomed) side-effect of this new behaviour is that there is no
more danger to see container-of-containers DuraLex nodes given they are
always flatten. At the upper side, when a local DuraLex tree is attached
to the global DuraLex tree, container nodes are again flatten. With this
mechanism, the global DuraLex tree should never have any container nodes,
they should only remain in half-read trees in ToSemanticTreeVisitor.

To be still able to continue parsing even if it remains container nodes,
they now have the type 'parsimonious-list-container'. Obviously if such
nodes are found, ToSemanticTreeVisitor will need to be debugged.

As a result of this operation, the DuraLex trees created by this visitor
are a bit different than previous one in the case of multiple values. For
instance the following amendment:
  À l’article L100-3 du code de l’énergie est ajouté deux phrases ainsi rédigées :
  "Phrase 1."
  "Phrase 2."
was previously the following DuraLex tree:
  type: edit; editType: replace
  ╠═ type: code-reference; id: code de l'énergie
  ║  ╚═ type: article-reference; id: L100-3
  ╠═ type: sentence-reference
  ║  ╚═ type: quote; words: Phrase 1.
  ╚═ type: sentence-definition
     ╚═ type: quote; words: Phrase 2.
and it is now:
  type: edit; editType: replace
  ╠═ type: code-reference; id: code de l'énergie
  ║  ╚═ type: article-reference; id: L100-3
  ╚═ type: sentence-reference; count: 2
     ╠═ type: quote; words: Phrase 1.
     ╚═ type: quote; words: Phrase 2.
I find it is even more meaningful given the number of sentences matches the
number of quotes. (I know that in some amendments both sentences are grouped
in the same quote, I almost find they could be splitted by a further
visitor, to be discussed and evaluated.) From a reusing point of view (in
SedLex, both could be easily read (currently SedLex only read the last one
in both cases, I will push shortly a light change so that SedLex will always
show all quotes).

Issue: #9
Seb35 added a commit that referenced this issue Dec 31, 2018
About the recent change with lists. We are now with 13 failing tests
instead of 23.

Also, added a property 'count' with value 1 when it is known that there
is one alinea or one sentence.

Also converted some u'' to '' in tests given we are Python3; and
reordered some of the properties in tests, I find a canonical order
'type', then other properties then 'children' is easier to read.

Issue: #9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant