Skip to content

Commit 0e82de1

Browse files
committed
Added swish notebook of minimal minCon work
1 parent 6e92de3 commit 0e82de1

File tree

1 file changed

+351
-0
lines changed

1 file changed

+351
-0
lines changed

limited_example_parsing.swinb

Lines changed: 351 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,351 @@
1+
<div class="notebook">
2+
3+
<div class="nb-cell markdown" name="md2">
4+
# Minimal Contract Language | minCon
5+
6+
The following describes first developments of minCon, as separate parts. More work would likely be needed to integrate them, but in any case before it works on a full document or has better debugging output, for debugging purposes and for the interested reader to verify and test, separate parts are much more practical. This is because if for example the full document is tested, but there is an error in sentence 10, even with thoughtful use of breakpoints and the trace functionality, reaching a problematic section can take quite some time.
7+
The work on minCon was inspired by Lexon, Logical English and other similar projects, but is more restricted and not compatible with any of them.
8+
9+
## Certain Restrictions
10+
11+
&gt; This list is probably incomplete
12+
13+
### Temporary
14+
* For now, separate parts are put separately that would normal be combined, as to save time on debugging integration errors.
15+
* A structure for putting together a solidity file from a parse tree exists, but it might be incomplete and it is untested as this would need a complete parse tree or manual work.
16+
17+
### Language or Example Restrictions
18+
* Most sentences have to follow a simple `subject, verb, object` pattern. This could easily be extended to allow for more complicated constructions, but for now this is a hard limit to limit the scope for this work.
19+
* No relative clauses / commas – adds verbosity and feels less natural but simplifies parsing a lot.
20+
* For now, variables are not allowed to have spaces. (will change later)
21+
* Once Solidity code is generated, no security guarantees are made.
22+
23+
### Important
24+
* It is important to know whether you are working with character codes or strings. As this is used in many places and is probably more accurate, current prolog developers seem to recommend this. But, especially during the tokenizing phase, this can be quite confusing, as evident below where `X` is often the text content, and this is displayed as character codes.
25+
</div>
26+
27+
<div class="nb-cell markdown" name="md3">
28+
### Tokenizer
29+
30+
The code below is code for a specialized tokenizer. It is currently not used in the further steps, as due to time constraints for now the existing tokenizer module was used as it suits most requirements. In the future, it could make later parsing more straightforward, by for example already tagging use of variables during tokenizing.
31+
</div>
32+
33+
<div class="nb-cell program" data-background="true" name="p1">
34+
% Custom tokenizer. Inspired by tokenize module by Anne Ogborn and Shon Feder for learning and the basics.
35+
:- set_prolog_flag(back_quotes, codes).
36+
:- use_module(library(dcg/basics)).
37+
38+
% test with : tokenizeT(Text, Tokens). (loads testtext from below)
39+
40+
tokenizeT(Text, Tokens) :-
41+
testfile(Text), phrase(tokens(Tokens), Text).
42+
43+
tokgram(T) --&gt; tokens(T).
44+
tokens(T) --&gt; token(T), eos, !.
45+
tokens([Th|Tt]) --&gt; token(Th), tokens(Tt).
46+
token(W) --&gt; word(W), eos, !.
47+
token(W), P --&gt; word(W), fst(P). %
48+
token(W), ` ` --&gt; word(W), ` `.
49+
token(W), C --&gt; word(W), (punct(C) ; cntrl(C) ; nasciis(C)).
50+
token(T) --&gt; space(T) ; hsep(T) ; comma(T), end(T) ; variable(T) ; number(T).
51+
token(P) --&gt; punct(P) ; cntrl(P).
52+
53+
hsep(T) --&gt; ":", {T = hsep(":")}.
54+
fst(T) --&gt; ".", {T = fst(".")}.
55+
comma([T]) --&gt; ",", {T = comma((,))}.
56+
space(T) --&gt; " ", {T = space}.
57+
end(T) --&gt; "\n", {T = end(end)}.
58+
variable(V) --&gt; string(`"`, `"`, V).
59+
60+
% nasciis, space/punct and escape were mostly copied as they are used by other parts.
61+
% should be possible to remove at some point though
62+
% non ascii's
63+
nasciis([C]) --&gt; nascii(C), eos, !.
64+
nasciis([C]),[D] --&gt; nascii(C), [D], {D &lt; 127}.
65+
nasciis([C|Cs]) --&gt; nascii(C), nasciis(Cs).
66+
nascii(C) --&gt; [C], {C &gt; 127}.
67+
%space --&gt; [S], {code_type(S, white)}.
68+
punct([P]) --&gt; [P], {code_type(P, punct)}.
69+
cntrl([C]) --&gt; [C], {code_type(C, cntrl)}.
70+
' ' --&gt; space.
71+
' ' --&gt; space, ' '.
72+
73+
% This is similar to csyms used in tokenize but was adapted
74+
word(W) --&gt; chars(Cl), {string_codes(W, Cl)}.
75+
chars([C]) --&gt; char(C).
76+
chars([Wh|Wt]) --&gt; char(Wh), chars(Wt).
77+
char(C) --&gt; [C], {code_type(C, csym)}.
78+
79+
% the strings part was copied from tokenize module as it does exactly what was required.
80+
% maybe it could be reduced to be less similar
81+
string(OpenBracket, CloseBracket, S) --&gt; string_start(OpenBracket, CloseBracket, S).
82+
83+
% A string starts when we encounter an OpenBracket
84+
string_start(OpenBracket, CloseBracket, Cs) --&gt;
85+
OpenBracket, string_content(OpenBracket, CloseBracket, Cs).
86+
87+
% String content is everything up until we hit a CloseBracket
88+
string_content(_OpenBracket, CloseBracket, []) --&gt; CloseBracket, !.
89+
% Part about string content unitl close bracket before escape removed
90+
% String content includes any character that isn't a CloseBracket or an escape.
91+
string_content(OpenBracket, CloseBracket, [C|Cs]) --&gt;
92+
[C],
93+
{[C] \= CloseBracket},
94+
string_content(OpenBracket, CloseBracket, Cs).
95+
96+
</div>
97+
98+
<div class="nb-cell markdown" name="md4">
99+
### Medium Length Example
100+
101+
This is a medium length example, limited to header, terms, contracts and one clause. It is used in some examples below where it is referenced by `testfile`. For the more detailed examples it was not used, as. debugging got then very difficult.
102+
</div>
103+
104+
<div class="nb-cell program" data-background="true" name="p3">
105+
testfile(`NAME: Evaluation License System.
106+
VERSION: 1
107+
AUTHOR: FLORIAN IDELBERGER
108+
PREAMBLE: This is a licensing contract for a software evaluation.
109+
TERMS:
110+
"Licensor" is a person.
111+
"Arbiter" is a person.
112+
"License" is this contract.
113+
"Licensing Fee" is an amount.
114+
"Breach Fee" is an amount.
115+
The Licensor appoints the Arbiter.
116+
The Licensor fixes the Licensing Fee.
117+
The Licensor fixes the Breach Fee.
118+
119+
CONTRACTS per Licensee:
120+
"Description of Goods" is a text.
121+
"Licensee" is a person.
122+
"Paid" is a binary.
123+
The Licensor fixes the Description of Goods.
124+
CLAUSE: Pay
125+
The Licensee pays the Licensing Fee to the Licensor,
126+
and pays the Breach Fee into escrow.
127+
The License is therefore Paid.`).
128+
</div>
129+
130+
<div class="nb-cell markdown" name="md5">
131+
The query below just tokenizes the above text. It is a lot of output, but it can show what kind of tokens are generated and how/if the text from above ends up in the corresponding variable.
132+
</div>
133+
134+
<div class="nb-cell query" name="q1">
135+
testfile(Text), tokenize(Text, Tokens).
136+
</div>
137+
138+
<div class="nb-cell markdown" name="md6">
139+
### Section Parsing
140+
141+
During the testing, research and development of different approaches to tokenizing, parsing and usage of DCGs in general, it became apparent that at least for my Prolog skills at the time, a top down approch with trying to define everything from the top down at once proved very difficult to me. This would be more easily possible, if it was iterated step by step, starting at the top without details, then breaking things down further bit by bit. To this end, this next code at first just parses sections, and then their contents. The contents are parsed with a generic predicate matching any amount of list items until the next rule. Especially on more complicate grammars, this breaks down quickly, and is unspecific and not pretty, but it works well for relatively simple parts without needing a separate rule.
142+
143+
&gt; TODO-NEXT: re-combine tokens of f.e. name to string
144+
145+
</div>
146+
147+
<div class="nb-cell program" data-background="true" name="p2">
148+
:- use_module(library(tokenize)).
149+
:- set_prolog_flag(back_quotes, codes).
150+
:- include(gvtree).
151+
152+
main(D) :-
153+
testfile(X), parseOne(X, D).
154+
155+
lexing(Source, Tokens) :-
156+
tokenize(Source, Tokens, [cased(false), spaces(false)]).
157+
158+
parseOne(X, D) :-
159+
lexing(X, T), phrase(doc(D), T).
160+
161+
doc(D) --&gt; sections(S), {D = lexdoc(S)}.
162+
sections([Sh|St]) --&gt; section(Sh), sections(St).
163+
sections(_S) --&gt; [].
164+
section(St) --&gt; [word(name), punct((:))], list(R), {St = docname(R)}.
165+
section(St) --&gt; [word(version), punct((:))], list(R), {St = version(R)}.
166+
section(St) --&gt; [word(author), punct((:))], list(R), {St = auth(R)}.
167+
section(St) --&gt; [word(preamble), punct((:))], list(R), {St = pre(R)}.
168+
section(St) --&gt; [word(terms), punct((:))], list(R), {St = terms(R)}.
169+
section(St) --&gt; [word(contracts)], list(N), [punct((:))], list(R), {St = contracts(N, R)}.
170+
section(St) --&gt; [word(clause), punct((:))], list(R), {St = clause(R)}.
171+
172+
list([]) --&gt; [].
173+
list([L|Ls]) --&gt; [L], list(Ls).
174+
175+
</div>
176+
177+
<div class="nb-cell markdown" name="md8">
178+
In the section above, there is a lexing predicate and a parsing predicate which are called by main/1. The sections are searched recursively, until none are left, defined by a keyword each and a `:`. The result of each section is then constructed into a compound term, as this makes it easier to construct and draw a nice tree. Alternatively dcg4pt by Falco Nogatz et al. could be used.
179+
This can then be queried like below, with the main/1 predicate creating the section tree, which is then displayed by graphiz. For use without the tree drawing predicate, it can be commented out with a `%` in front of it.
180+
In the tree below, sections, variables and all tokens from the first step are represented specifically, while other words are just referenced by `word(&lt;Wordname&gt;)`.
181+
</div>
182+
183+
<div class="nb-cell query" name="q2">
184+
main(D), gvtree(D, T).
185+
</div>
186+
187+
<div class="nb-cell markdown" name="md9">
188+
### Alternative Section Tree
189+
190+
The code below takes the previous tree and puts it into new predicates. This was mainly done as an exercise to test the parsing of the tree structure, and to ease testing further parsing on just one section, without manually copying and supplying a tokenlist.
191+
</div>
192+
193+
<div class="nb-cell program" data-background="true" name="p4">
194+
hh(L, V, A, P, T, CN, C, CC, PP) --&gt; lll(L), vvv(V), aaa(A), ppp(P), ttt(T), ccc(CN, C), clauses(CC), {PP = hh(L, V, A, P, T, CN, C, CC)}.
195+
lll(L) --&gt; [docname(L)].
196+
vvv(V) --&gt; [version(V)].
197+
aaa(A) --&gt; [auth(A)].
198+
ppp(P) --&gt; [pre(P)].
199+
ttt(T) --&gt; [terms(T)].
200+
ccc(CN, C) --&gt; [contracts(CN, C)].
201+
202+
clauses([]) --&gt; []. % this is wrong but w/o it loops/exceeds stack.
203+
clauses(CH) --&gt; cclause(X), clauses(CC), {CH = [X|CC]}.
204+
cclause(X) --&gt; [clause(X)].
205+
206+
parseTwo(Token2, Result) :-
207+
arg(1, Token2, E), !, phrase(hh(L, V, A, P, T, CN, C, CC, PP), E).
208+
parsemain(L, V, A, P, T, CN, C, CC, PP) :-
209+
testfile(X), parseOne(X, D), arg(1, D, E), !, phrase(hh(L, V, A, P, T, CN, C, CC, PP), E), !, gvtree(PP, TT).
210+
</div>
211+
212+
<div class="nb-cell markdown" name="md10">
213+
The benefit descrived above can be seen below - all section are given as separate variables with corresponding tokenlists. For a complete application this might not be necessary - but it makes it much easier to evaluate at a glance if one part looks okay, as compared to the packed tree structure above.
214+
</div>
215+
216+
<div class="nb-cell query" name="q3">
217+
% remove outer element
218+
%testfile(X), parseOne(X, D), arg(1, D, E), unpack(E). %, gvtree(hh(L, V, A, P, T, CN, C, CC), T).
219+
testfile(X), parseOne(X, D), arg(1, D, E), !, phrase(hh(L, V, A, P, T, CN, C, CC, PP), E), !, gvtree(PP, TT).
220+
</div>
221+
222+
<div class="nb-cell markdown" name="md11">
223+
### Intermediate Attempt
224+
225+
For an intermediary result, the test text was reduced further, keeping just the header and two terms. The rules that this was tested with, were removed here. This is just kept for documentation or for potential future development, as for the most explicit parsing below, this was reduced further.
226+
</div>
227+
228+
<div class="nb-cell program" name="p5">
229+
testmin(`NAME: Evaluation License System.
230+
VERSION: 1
231+
AUTHOR: FLORIAN IDELBERGER
232+
PREAMBLE: This is a licensing contract for a software evaluation.
233+
TERMS:
234+
"Licensor" is a person.
235+
The Licensor fixes the Breach Fee.`).
236+
testvar(`"Licensor" is a person.`).
237+
parsemin(L, V, A, P, T, CN, C, CC, PP) :-
238+
testmin(X), parseOne(X, D), arg(1, D, E), !, phrase(hh(L, V, A, P, T, CN, C, CC, PP), E), !, gvtree(PP, TT).
239+
</div>
240+
241+
<div class="nb-cell query" name="q5">
242+
parsemin(L, V, A, P, T, CN, C, CC, PP), selectchk(cntrl('\n'), T, T2), trace, phrase(termsparse(TP), T2), write(TP).
243+
</div>
244+
245+
<div class="nb-cell markdown" name="md1">
246+
### Sentence parsing
247+
248+
After many trials and errors, to make progress on the actual parsing of sentences, the test text was reduced further to two sentences, where the first defines a variable and a type, and the second specifies an action by the first. These were selected as they exemplify two common actions, the variable definition and the assignment of said variable. This simplified debugging with standard prolog tools immensely and quickly showed results.
249+
It should in principle also work for more than the two example terms/sentences, as sentences are parsed until none are left.
250+
251+
A major feature also is that the variables that are defined by `"X"` are properly kept track of and recognized when they are used elsewhere. This required a separate variable to pass state (in this case the state of the variables) around between rules. This is a key requirement for being a usable programming language.
252+
253+
As a further temporary measure, breachfee was defined as a separate rule in a fixed way, as its definition was not included in the test sentences, but all sentences are dependent on each other and any extra sentences would have made debugging harder. In a more complete implementation, this crutch would be removed.
254+
</div>
255+
256+
<div class="nb-cell program" name="p6">
257+
:- set_prolog_flag(back_quotes, codes).
258+
:- use_module(library(tokenize)).
259+
:- include(gvtree).
260+
%:- use_module(library(dcg/basics)).
261+
262+
lexing(Source, Tokens) :-
263+
tokenize(Source, Tokens, [cased(false), spaces(false)]).
264+
265+
testvar(`"Licensor" is a person.
266+
The Licensor fixes the breachfee.
267+
`).
268+
sents([], Vars).
269+
sents([SentsH|SentsT], Vars) --&gt; sent(SentsH, Vars), sents(SentsT, Vars).
270+
sents([SentsH|SentsT], Vars) --&gt; sent(SentsH, Vars).
271+
sent(Sent, Vars) --&gt; subject(S, Vars), verb, object(VU, Vars), end, {Sent = sent(subject(S), verb, object(VU))}.
272+
article --&gt; [word(a)] ; [word(the)].
273+
subject(S, Vars) --&gt; vardefinition(S, Vars).
274+
subject(S, Vars) --&gt; article, variable_use(S, Vars).
275+
vardefinition(S, Vars) --&gt; {var(Vars)}, [string(S)], {Vars = [S]}.
276+
vardefinition(S, Vars) --&gt; {write(Vars), nonvar(Vars), duplicate_term(Vars, VarsN)}, [string(S)], {Vars = [S|VarsN]}.
277+
type(T, Vars) --&gt; (([word(binary)], {T = type(binary)}); ([word(person)], {T = type(person)})).
278+
279+
% does not support spaces yet
280+
variable_use(VU, Vars) --&gt; [word(X)], {member(X, Vars), VU = varu(X)}.
281+
verb --&gt; ([word(is)] ; [word(fixes)]).
282+
end --&gt; [punct(('.')), cntrl('\n')].
283+
end --&gt; [punct(('.'))].
284+
object(T, Vars) --&gt; article, type(T, Vars).
285+
object(VU, Vars) --&gt; article, variable_use(VU, Vars).
286+
object(S, Vars) --&gt; article, fee(S, Vars).
287+
% this is only a crutch as breach fee is undefined in this example.
288+
% (also breached should be lower case, but somehow tokenize did not uncase it
289+
fee(F, Vars) --&gt; [word(breachfee)], {F = fee(breachfee, Vars)}.
290+
</div>
291+
292+
<div class="nb-cell markdown" name="md12">
293+
### Sentence Tree
294+
Anyway, the query below will tokenize the above two sentences, and parse them with the grammar above.
295+
This defines sentences made up of one or more sentences, which are then made up of subject, verb and object. In this case there are only two verbs, parsed to the same action (as it is close enough to infer). The subject can either be a defintion of a variable, or the usage of a variable. In the former case, the object then defines its type as necessary for some possible target languages such as Solidity. In the latter case, it is defined how the value of a variable is set. In future versions it could also be set directly.
296+
297+
In any case, this limited example also gives a nice, small parse tree of the sentences, where it is much more easily visible what is going on than if this was taken of the whole document.
298+
299+
Please note: In the tree, `sent` is short for sentence, and `varu` is variable use.
300+
</div>
301+
302+
<div class="nb-cell query" name="q6">
303+
testvar(X), lexing(X, T), phrase(sents(S, V), T), gvtree(S, Tree).
304+
</div>
305+
306+
<div class="nb-cell markdown" name="md13">
307+
Further integration and extension is easily possible at this point. Ideally, the rules and length of the document should be extended slowly or always accompanied by a smaller testing document, as to keep debugging time short. Alternatively, it is possible to include debugging options directly into a DCG grammar, but this is not simple or at least not explained in most books about Prolog (not even in 'Art of Prolog' which goes into quite some depth) and would necessitate further study.
308+
</div>
309+
310+
<div class="nb-cell markdown" name="md7">
311+
# GraphizTree
312+
313+
This part below is necessary for drawing trees based on parse trees. It should be pasted in a separate tab inside swish or in a separate file and is then included in the places where it is used via `:- include(gvtree).` - this code was helpfully provided by a prolog community member. &lt;check who&gt;
314+
</div>
315+
316+
<div class="nb-cell program" name="p7">
317+
:- use_rendering(graphviz).
318+
319+
tree(Compound, Root, Options0, Options) --&gt;
320+
{ compound(Compound), !,
321+
atom_concat(n, Options0.id, Root),
322+
compound_name_arguments(Compound, Name, Arguments),
323+
format(string(Label), '~q', [Name]),
324+
ID1 is Options0.id+1
325+
},
326+
[node(Root, [label=Label])],
327+
children(Arguments, Root, Options0.put(id, ID1), Options).
328+
tree(Any, Leaf, Options0, Options) --&gt;
329+
{ atom_concat(n, Options0.id, Leaf),
330+
ID1 is Options0.id+1,
331+
any_label(Any, Label, Color),
332+
Options = Options0.put(id, ID1)
333+
},
334+
[ node(Leaf, [label=Label, shape=none, fontcolor=Color]) ].
335+
336+
any_label(Any, Label, red4) :-
337+
var(Any), !, Label = Any.
338+
any_label(Any, Label, blue) :-
339+
format(string(Label), '~p', [Any]).
340+
341+
children([], _, Options, Options) --&gt; [].
342+
children([H|T], Parent, Options0, Options) --&gt;
343+
[ Child -&gt; Parent ],
344+
tree(H, Child, Options0, Options1),
345+
children(T, Parent, Options1, Options).
346+
347+
gvtree(Term, digraph([rankdir='BT',size=5|Statements])) :-
348+
phrase(tree(Term, _, _{id:1}, _), Statements).
349+
</div>
350+
351+
</div>

0 commit comments

Comments
 (0)