Skip to content

Commit 88599b9

Browse files
committed
DiMLex 1.0 -- major extension (added ~100 connectives) and validation
1 parent befcf0f commit 88599b9

File tree

3 files changed

+9390
-4686
lines changed

3 files changed

+9390
-4686
lines changed

DimLex-documentation.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# DiMLex - Structure
2+
3+
The lexicon is constructed of a number of lexicon entries, numbered by an 'id'.
4+
5+
Schematically, each entry consists of the following data fields:
6+
7+
**`<orth/>`** List of orthographic variants for this entry. One variant is marked as the 'canonical' spelling.
8+
9+
A connective can be 'phrasal' or a 'single' item, furthermore, phrasal connectives can be 'cont'inuous or 'discont'inuous
10+
11+
**`<disambi/>`** Information on whether this form is ambiguous between a connective/non-connective reading (`<conn_d/>`) and different semantic readings (`<sem_d/>`).
12+
13+
**`<conn_disambi/>`** Disambiguation rules that distinguish between connective/non-connective usages.
14+
15+
This information is not included in the current release of DiMLex.
16+
17+
**`<focuspart/>`** Whether or not this connective allows for associated focus particles.
18+
19+
**`<correlate/>`** Some connectives frequently appear with "correlates" to mark a discourse relation.
20+
21+
It is noted whether the connective is a correlate (`<is_correlate/>`) of another one or whether it can have correlates (`<has_correlate/>`).
22+
23+
**`<non_conn_reading/>`** Examples and possible POS tags of a usage of this item in its non-connective reading.
24+
25+
**`<stts/>`** Common POS tags, with examples and corpus frequencies.
26+
27+
**`<syn/>`** Syntactic and semantic information on this connective.
28+
29+
For ambiguous connectives (that are also syntactically ambiguous), several `<syn/>`-blocks are allowed.
30+
31+
The syntax block is further divided into the following components:
32+
33+
* **`<cat/>`** Syntactic category.
34+
35+
For German, one of:
36+
37+
- *konnadv* (adverb),
38+
- *padv* (adverb with prepositional part),
39+
- *konj* (coordinating conjunction), 'und'
40+
- *subj* (subordinating conjunction), 'weil', 'obwohl' (requires verb-final complement, can be moved in matrix clause)
41+
- *v2emb* (v2-embedder), 'vorausgesetzt' (V2 complement, but embedded clause can be moved in matrix)
42+
- *postp* (postponer), 'weshalb' (verb-final complement, cannot be moved in matrix clause)
43+
- *appr* (preposition), 'anstatt'
44+
- *appo* (postposition), 'wegen'
45+
- *apci* (circumposition), 'um ... willen'
46+
- *einzel* (isolated) 'dass'
47+
48+
* **`<integr/>`** For adverbs, indication in which syntactic positions they can occur (traditional German 'Felder'-syntax).
49+
50+
51+
This information can be provided upon request!
52+
53+
* **`<ordering/>`** Options for the linear order of arguments arg1 and arg2: *ante*, *post*, *insert*, and/or *desintegr*
54+
55+
* **`<sem/>`** Information on the coherence relation(s) expressed by the connective.
56+
57+
This information is not included in the current release of DiMLex.

DimLex.dtd

Lines changed: 104 additions & 189 deletions
Original file line numberDiff line numberDiff line change
@@ -1,210 +1,125 @@
1-
<?xml version='1.0' encoding='UTF-8'?>
1+
<?xml encoding="UTF-8"?>
22

3-
<!--
4-
TODO define vocabulary indentification
5-
PUBLIC ID: -//vendor//vocabulary//EN
6-
SYSTEM ID: http://server/path/DimLex.dtd
3+
<!ELEMENT dimlex (entry)+>
4+
<!ATTLIST dimlex
5+
xmlns CDATA #FIXED ''>
76

8-
--><!--
9-
An example how to use this DTD from your XML document:
7+
<!ELEMENT entry (orths,disambi,focuspart,correlate,non_conn_reading,
8+
stts,syn+)>
9+
<!ATTLIST entry
10+
xmlns CDATA #FIXED ''
11+
edit NMTOKEN #IMPLIED
12+
id NMTOKEN #REQUIRED
13+
word CDATA #REQUIRED>
1014

11-
<?xml version="1.0"?>
15+
<!ELEMENT orths (orth)+>
16+
<!ATTLIST orths
17+
xmlns CDATA #FIXED ''>
1218

13-
<!DOCTYPE dimlex SYSTEM "DimLex.dtd">
19+
<!ELEMENT disambi (conn_d,sem_d)>
20+
<!ATTLIST disambi
21+
xmlns CDATA #FIXED ''>
1422

15-
<dimlex>
16-
...
17-
</dimlex>
18-
-->
23+
<!ELEMENT focuspart (#PCDATA)>
24+
<!ATTLIST focuspart
25+
xmlns CDATA #FIXED ''>
1926

20-
<!--- Put your DTDDoc comment here. -->
21-
<!ELEMENT referierend EMPTY>
27+
<!ELEMENT correlate ((is_correlate,has_correlate)?,correlatee?)>
28+
<!ATTLIST correlate
29+
xmlns CDATA #FIXED ''>
2230

23-
<!--- Put your DTDDoc comment here. -->
24-
<!ELEMENT kopulativ EMPTY>
31+
<!ELEMENT non_conn_reading (#PCDATA|example)*>
32+
<!ATTLIST non_conn_reading
33+
xmlns CDATA #FIXED ''>
2534

26-
<!--- Put your DTDDoc comment here. -->
27-
<!ELEMENT konsekutiv EMPTY>
35+
<!ELEMENT stts (example)*>
36+
<!ATTLIST stts
37+
xmlns CDATA #FIXED ''>
2838

29-
<!--- Put your DTDDoc comment here. -->
30-
<!ELEMENT konzessiv EMPTY>
39+
<!ELEMENT syn (cat,praep?,ordering?)>
40+
<!ATTLIST syn
41+
xmlns CDATA #FIXED ''>
3142

32-
<!--- Put your DTDDoc comment here. -->
33-
<!ELEMENT komitativ EMPTY>
34-
35-
<!--- Put your DTDDoc comment here. -->
36-
<!ELEMENT konditional EMPTY>
37-
38-
<!--- Put your DTDDoc comment here. -->
39-
<!ELEMENT komparativ (#PCDATA)>
40-
41-
<!--- Put your DTDDoc comment here. -->
42-
<!ELEMENT konj EMPTY>
43-
44-
<!--- Put your DTDDoc comment here. -->
45-
<!ELEMENT alternativ EMPTY>
46-
47-
<!--- Put your DTDDoc comment here. -->
48-
<!ELEMENT restriktiv-conditional EMPTY>
49-
50-
<!--- Put your DTDDoc comment here. -->
51-
<!ELEMENT restriktiv EMPTY>
52-
53-
<!--- Put your DTDDoc comment here. -->
54-
<!ELEMENT einzel EMPTY>
55-
56-
<!--- Put your DTDDoc comment here. -->
57-
<!ELEMENT padv (satzklammer|nachnachfeld|nullstelle|nachfeld|nacherst|mittelfeld|vorfeld)*>
58-
59-
<!--- Put your DTDDoc comment here. -->
60-
<!ELEMENT ersatz EMPTY>
61-
62-
<!--- Put your DTDDoc comment here. -->
63-
<!ELEMENT substitutiv EMPTY>
64-
65-
<!--- Put your DTDDoc comment here. -->
66-
<!ELEMENT kasus (#PCDATA)>
67-
68-
<!--- Put your DTDDoc comment here. -->
69-
<!ELEMENT praep (kasus|post|ante)*>
70-
71-
<!--- Put your DTDDoc comment here. -->
72-
<!ELEMENT kontrastiv EMPTY>
73-
74-
<!--- Put your DTDDoc comment here. -->
75-
<!ELEMENT postp EMPTY>
76-
77-
<!--- Put your DTDDoc comment here. -->
78-
<!ELEMENT spezifizierung EMPTY>
79-
80-
<!--- Put your DTDDoc comment here. -->
81-
<!ELEMENT modal (komparativ|restriktiv|spezifizierung)*>
82-
83-
<!--- Put your DTDDoc comment here. -->
84-
<!ELEMENT spezifizierend EMPTY>
85-
86-
<!--- Put your DTDDoc comment here. -->
87-
<!ELEMENT kausal EMPTY>
88-
89-
<!--- Put your DTDDoc comment here. -->
90-
<!ELEMENT vorzeitigkeit EMPTY>
91-
92-
<!--- Put your DTDDoc comment here. -->
93-
<!ELEMENT nachzeitigkeit EMPTY>
43+
<!ELEMENT orth (part)+>
44+
<!ATTLIST orth
45+
xmlns CDATA #FIXED ''
46+
canonical CDATA #REQUIRED
47+
onr NMTOKEN #REQUIRED
48+
type NMTOKEN #REQUIRED>
49+
50+
<!ELEMENT conn_d (#PCDATA)>
51+
<!ATTLIST conn_d
52+
xmlns CDATA #FIXED ''
53+
edit NMTOKEN #IMPLIED>
54+
55+
<!ELEMENT sem_d (#PCDATA)>
56+
<!ATTLIST sem_d
57+
xmlns CDATA #FIXED ''
58+
edit NMTOKEN #IMPLIED>
59+
60+
<!ELEMENT is_correlate (#PCDATA)>
61+
<!ATTLIST is_correlate
62+
xmlns CDATA #FIXED ''>
63+
64+
<!ELEMENT has_correlate (#PCDATA)>
65+
<!ATTLIST has_correlate
66+
xmlns CDATA #FIXED ''>
67+
68+
<!ELEMENT correlatee (corr)+>
69+
<!ATTLIST correlatee
70+
xmlns CDATA #FIXED ''>
71+
72+
<!ELEMENT cat (#PCDATA)>
73+
<!ATTLIST cat
74+
xmlns CDATA #FIXED ''>
75+
76+
<!ELEMENT praep (ante,post,circum,case+)>
77+
<!ATTLIST praep
78+
xmlns CDATA #FIXED ''>
79+
80+
<!ELEMENT ordering ((ante,post)?,insert?,desintegr?)>
81+
<!ATTLIST ordering
82+
xmlns CDATA #FIXED ''
83+
edit NMTOKEN #IMPLIED>
9484

95-
<!--- Put your DTDDoc comment here. -->
96-
<!ELEMENT gleichzeitigkeit EMPTY>
85+
<!ELEMENT part (#PCDATA)>
86+
<!ATTLIST part
87+
xmlns CDATA #FIXED ''
88+
type NMTOKEN #REQUIRED>
9789

98-
<!--- Put your DTDDoc comment here. -->
99-
<!ELEMENT aspect (#PCDATA)>
100-
<!ATTLIST aspect
101-
scope CDATA #IMPLIED
102-
>
90+
<!ELEMENT corr (#PCDATA)>
91+
<!ATTLIST corr
92+
xmlns CDATA #FIXED ''>
10393

104-
<!--- Put your DTDDoc comment here. -->
105-
<!ELEMENT aktionsart (#PCDATA)>
106-
<!ATTLIST aktionsart
107-
scope CDATA #IMPLIED
108-
>
94+
<!ELEMENT circum (#PCDATA)>
95+
<!ATTLIST circum
96+
xmlns CDATA #FIXED ''>
10997

110-
<!--- Put your DTDDoc comment here. -->
111-
<!ELEMENT interval (#PCDATA)>
98+
<!ELEMENT case (#PCDATA)>
99+
<!ATTLIST case
100+
xmlns CDATA #FIXED ''
101+
edit NMTOKEN #IMPLIED>
112102

113-
<!--- Put your DTDDoc comment here. -->
114-
<!ELEMENT temporal (vorzeitigkeit|nachzeitigkeit|gleichzeitigkeit|aspect|aktionsart|interval)*>
103+
<!ELEMENT insert (#PCDATA)>
104+
<!ATTLIST insert
105+
xmlns CDATA #FIXED ''>
115106

116-
<!--- Put your DTDDoc comment here. -->
117107
<!ELEMENT desintegr (#PCDATA)>
108+
<!ATTLIST desintegr
109+
xmlns CDATA #FIXED ''>
118110

119-
<!--- Put your DTDDoc comment here. -->
120-
<!ELEMENT subj EMPTY>
121-
122-
<!--- Put your DTDDoc comment here. -->
123-
<!ELEMENT nintegr (konj|postp|subj)*>
124-
125-
<!--- Put your DTDDoc comment here. -->
126-
<!ELEMENT beispiel (#PCDATA)>
127-
128-
<!--- Put your DTDDoc comment here. -->
129-
<!ELEMENT relation (#PCDATA)>
130-
131-
<!--- Put your DTDDoc comment here. -->
132-
<!ELEMENT hbfunktion (kopulativ|konzessiv|konsekutiv|konditional|kausal|alternativ|ersatz|modal|temporal|adversativ)*>
111+
<!ELEMENT example (#PCDATA)>
112+
<!ATTLIST example
113+
xmlns CDATA #FIXED ''
114+
tfreq CDATA #IMPLIED
115+
type NMTOKEN #IMPLIED>
133116

134-
<!--- Put your DTDDoc comment here. -->
135-
<!ELEMENT adversativ EMPTY>
136-
137-
<!--- Put your DTDDoc comment here. -->
138-
<!ELEMENT grfunktion (referierend|kopulativ|konsekutiv|adversativ|temporal|kausal|spezifizierend|kontrastiv|substitutiv|restriktiv|restriktiv-conditional|alternativ|komparativ|konditional|komitativ|konzessiv)*>
139-
140-
<!--- Put your DTDDoc comment here. -->
141-
<!ELEMENT sem (beispiel|relation|hbfunktion|grfunktion)*>
142-
143-
<!--- Put your DTDDoc comment here. -->
144-
<!ELEMENT insert (#PCDATA)>
145-
146-
<!--- Put your DTDDoc comment here. -->
147-
<!ELEMENT post (#PCDATA)>
148-
149-
<!--- Put your DTDDoc comment here. -->
150117
<!ELEMENT ante (#PCDATA)>
118+
<!ATTLIST ante
119+
xmlns CDATA #FIXED ''
120+
edit NMTOKEN #IMPLIED>
151121

152-
<!--- Put your DTDDoc comment here. -->
153-
<!ELEMENT abfolge (desintegr|insert|post|ante)*>
154-
155-
<!--- Put your DTDDoc comment here. -->
156-
<!ELEMENT satzklammer (#PCDATA)>
157-
158-
<!--- Put your DTDDoc comment here. -->
159-
<!ELEMENT nachnachfeld (#PCDATA)>
160-
161-
<!--- Put your DTDDoc comment here. -->
162-
<!ELEMENT nullstelle (#PCDATA)>
163-
164-
<!--- Put your DTDDoc comment here. -->
165-
<!ELEMENT nachfeld (#PCDATA)>
166-
167-
<!--- Put your DTDDoc comment here. -->
168-
<!ELEMENT nacherst (#PCDATA)>
169-
170-
<!--- Put your DTDDoc comment here. -->
171-
<!ELEMENT mittelfeld (#PCDATA)>
172-
173-
<!--- Put your DTDDoc comment here. -->
174-
<!ELEMENT vorfeld (#PCDATA)>
175-
176-
<!--- Put your DTDDoc comment here. -->
177-
<!ELEMENT konnadv (satzklammer|nachnachfeld|nullstelle|nachfeld|nacherst|mittelfeld|vorfeld)*>
178-
179-
<!--- Put your DTDDoc comment here. -->
180-
<!ELEMENT integr (padv|konnadv)*>
181-
182-
<!--- Put your DTDDoc comment here. -->
183-
<!ELEMENT konn (einzel|nintegr|integr)*>
184-
185-
<!--- Put your DTDDoc comment here. -->
186-
<!ELEMENT kat (#PCDATA)>
187-
188-
<!--- Put your DTDDoc comment here. -->
189-
<!ELEMENT syn (praep|sem|abfolge|konn|kat)*>
190-
191-
<!--- Put your DTDDoc comment here. -->
192-
<!ELEMENT part (#PCDATA)>
193-
<!ATTLIST part
194-
type CDATA #IMPLIED
195-
>
196-
197-
<!--- Put your DTDDoc comment here. -->
198-
<!ELEMENT orth (part)*>
199-
<!ATTLIST orth
200-
type CDATA #IMPLIED
201-
>
202-
203-
<!--- Put your DTDDoc comment here. -->
204-
<!ELEMENT eintrag (syn|orth)*>
205-
<!ATTLIST eintrag
206-
id CDATA #IMPLIED
207-
>
208-
209-
<!--- Put your DTDDoc comment here. -->
210-
<!ELEMENT dimlex (eintrag)*>
122+
<!ELEMENT post (#PCDATA)>
123+
<!ATTLIST post
124+
xmlns CDATA #FIXED ''
125+
edit NMTOKEN #IMPLIED>

0 commit comments

Comments
 (0)