-
Notifications
You must be signed in to change notification settings - Fork 0
/
tokenizer.rpp
271 lines (248 loc) · 12.8 KB
/
tokenizer.rpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
;;; -*- mode: fundamental; coding: utf-8; indent-tabs-mode: t; -*-
;;;
;;; Copyright (c) 2009 -- 2010 Stephan Oepen (oe@ifi.uio.no);
;;; see `LICENSE' for conditions.
;;;
;;;
;;; another shot at a finite-state language for preprocessing, normalization,
;;; and tokenization in LKB grammars. requires LKB version of 1-feb-09 or
;;; newer. note that the syntax is rigid: everything starting in column 2
;;; (i.e. right after the rule type marker) is used as the match pattern until
;;; the first `\t' (tabulator sign); one or more tabulators are considered the
;;; separator between the matching pattern and the replacement, but other
;;; whitespace will be considered part of the patterns. empty lines or lines
;;; with a semicolon in column 1 (i.e. in place of the rule type marker, this
;;; is not Lisp) will be ignored.
;;;
;;; this is a fresh attempt (as of September 2008) at input tokenization. for
;;; increased compatibility with existing tools (specifically taggers trained
;;; on the PTB), we now assume a PTB-like tokenization in pre-processing. the
;;; grammar includes token mapping rules (using the new chart mapping machinery
;;; in PET) to eventually adjust (i.e. correct, in some cases) tokenization to
;;; its needs. specifically, many punctuation marks will be re-combined with
;;; preceding or following tokens, reflecting standard orthographic convention,
;;; and are then analyzed as pseudo-affixes.
;;;
;;; this file is inspired by the PTB `tokenizer.sed' script, and by and large
;;; should yield very similar results. with the addition of token mapping as
;;; a separate step inside the parser, we want to restrict RE-based processing
;;; to pure string-level phenomena. however, to actually tokenize (following
;;; some set of principles), we need to do more than just break at whitespace.
;;; some punctuation marks give rise to token boundaries, but not all. also,
;;; inputs (in the 21st century) may contain some amount of mark-up, where XML
;;; character references have become relatively common. full UniCode support
;;; now makes it possible to represent a much larger range of characters, e.g.
;;; various types of quotes and dashes. we aim to map mark-up to corresponding
;;; UniCode characters, and preserve those in parsing, as much as possible.
;;;
;;; the original `tokenizer.sed' script actually cannot always yield the exact
;;; tokenization found in the PTB. the script unconditionally separates a set
;;; of punctuation or other non-alphanumeric characters (e.g. |&| and |!|) that
;;; may be part of a single token (say in |AT&T| or URLs). we aim to do better
;;; than the original script, here, conditioning on adjacent whitespace.
;;;
;;
;; preprocessor rules versioning; auto-maintained upon CVS (or SVN) check-in.
;;
@$Date: 2014-10-07 00:00:51 +0200 (Tue, 07 Oct 2014) $
;;
;; tokenization pattern: after normalization, the string will be broken up at
;; each occurrence of this pattern; the pattern match itself is deleted.
;;
:[ \t]+
;;;
;;; string rewrite rules: all matches, over the entire string, are replaced by
;;; the right-hand side; grouping (using `(' and `)') in the pattern) and group
;;; references (`\1' for the first group, et al.) carry over part of the match.
;;;
;;
;; pad the full string with trailing and leading whitespace; makes matches for
;; word boundaries a little easier down the road; also, squash multiple spaces
;; and replace tabulators and ‘weird’ spaces with an ASCII space.
;;
;; _fix_me_
;; in the 21st century, there are of course various whitespace code points, so
;; maybe we should look for a UniCode property to use here. however, it used
;; to be the case, at least, that the LKB implementation of REPP did not have
;; the full UniCode CL-PPCRE extensions (and even if it did, one might first
;; have to read up on interactions of locales and UniCode properties, before
;; relying on such fancy whitespace detection. (29-nove-12; oe)
;;
!^(.+)$ \1
![\t ]
! +
;; a set of `mark-up modules', often replacing mark-up character entities with
;; actual UniCode characters (e.g. |—| or |---| as |—|), or just ditching
;; mark-up that has no bearing on parsing for now (e.g. most wiki mark-up).
;; these modules can be activated selectively by name in the REPP environment
;; or the top-level call into REPP.
>xml
>latex
>ascii
>html
>wiki
>lgt
>gml
;;
;; we hope to keep it to a minimum and only invoke it in select configurations,
;; but at times we need to heuristically `fix up' deficiencies in the input;
;; for example spurious duplicate punctuation, and such.
;;
>robustness
;;
;; our treatment of quote marks continues to evolve. as of february 2010, it
;; seems we need to support two distinct scenarios: (a) inputs using standard
;; typography (quotes either are directional already or can be diambiguated by
;; adjacency to preceding or following whitespace) vs. (b) inputs that already
;; have been pre-tokenized or otherwise damaged (as in the current CoNLL 2010
;; training data), where we stand no chance of disambiguating quote marks. in
;; downstream processing, we thus need to handle six types of quotes: genuine
;; directional quotes (|“|, |”|, |‘|, and |’|) and ambiguous, straight ones
;; (|"| and |'|). the former group can be restricted to opening and closing
;; functions, and in the case of closing quotes also to apostrophes and units
;; of measure (feet and inches, or seconds and minutes; for which in `proper'
;; typography the double prime and prime symbols could be used); the straight
;; quotes, on the other hand, need to be considered ambiguous between opening
;; and closing, apostrophe, or prime usages. (9-feb-10; oe)
;;
>quotes
;;
;; two special cases involving periods: map ASCII ellipsis (|...|) to a single
;; UniCode character (|…|), and convert |..| between numbers into an n-dash,
;; i.e. a numeric range (typically tokenized off, i.e. |42| |–| |43|). maybe
;; the latter can also occur between non-numbers? we could also just preserve
;; it, but always make it a token in its own right?
;;
;; _fix_me_
;; what about a sentence-final period following the ellipsis (as in cb/7060)?
;; (24-sep-08; oe)
;; _fix_me_
;; several years later, i believe the ellipsis (|…|) should just be considered
;; a suffix punctuation mark (possibly contributing a piece of semantics) that
;; contributes the same punctuation quality as token-final periods. however,
;; so far we have kept periods following ellipsis as separate tokens, and this
;; also appears to be the analysis in the venerable PTB (and hence we cannot
;; revise it while finalizing DeepBank 1.0). at the same time, the PTB has a
;; number of sentences ending in just ellipsis, and (for maximum compliance)
;; we want to normalize these to the UniCode glyph too. (10-feb-13; oe)
;;
!(?:(?:\. ){3,}|(?:\.){3,})(?![])}”’ ]*$) …
!(?:\. ){3}|(?:\.){3} …
!([0-9]) *\.\. *([0-9]) \1 – \2
;;
;; some UniCode characters force token boundaries: m-dash, ellipsis. we used
;; to include n-dashes in this list, but some authors (e.g. ESR) use n-dashes
;; much like hyphens, attempting to be clever about bracketing in expressions
;; like |non–source-aware| [users].
;;
!([—…]) \1
;;
;; remove space after initial |O'| and |L'|, i.e. irish and romance names, to
;; avoid stripping off their apostrophes.
;;
! ([OL])['’] \1’
;;
;; a new REPP facility: named groups and iterative group calls. there are a
;; number of characters that PTB tokenizes off (unconditionally, it seems, in
;; the original `tokenizer.sed'), though not when they are parts of names or
;; NE patterns, e.g. |AT&T| or |http://www.emmtee.net/?foo.php&bar=42|. thus,
;; we only want these as separate tokens when they are preceded or followed by
;; whitespace; this leaves a problem with, say, |http://www.emmtee.net/|, where
;; one would have to apply NE recognition (what used to be `ersatzing') _prior_
;; to tokenization.
;;
;; either way, because characters we want to tokenize off might be `clustered'
;; with each other, e.g. |(42%), |, the notion of adjacent whitespace needs to
;; apply transitively through such clusters. it seems an iterative group is
;; the most straightforward way of getting that effect. the rules from the
;; group will be applied repeatedly (in order) at the time the group is called
;; (by means of the `>' operator), until there are no further matches. we need
;; to be careful to avoid indefinite recursion within the group, i.e. not add
;; duplicate spaces. thus, once again, make sure the full string is bracketed
;; by single spaces, and ditch multiple spaces initially; then we can require a
;; non-whitespace preceding or following context, besides the actual spaces.
;;
;; at this point, we exclude a few punctuation characters from this policy, in
;; part because that is the PTB approach (|-| and |/|), in part because they
;; can be prefixes or suffixes of one-token named entities, i.e. |<| and |>| in
;; URLs and email addresses. to work around these, we may need a string-level
;; `ersatzing' facility, associating a sub-string (that can be unambiguously
;; identified by surface properties, e.g. a URL) with an identifier of a token
;; class.
;;
;; like in the original PTB script, periods are only tokenized off in sentence-
;; final position, maybe followed only by closing quote marks or parentheses.
;;
;; the story for parentheses and square brackets is somehwat involved, to avoid
;; breaking up inputs like |factor(s)| or |Ca[2+]|.
;;
;; _fix_me_
;; our current solution unconditionally strips off leading parentheses, which
;; would break an input like |(mis-)read|. talk to dan about how best to deal
;; with these (maybe see whether we find actual examples). (26-feb-10; oe)
;;
;; _fix_me_
;; there is an issue with some of the characters that are asserted (at least in
;; whitespace adjacency) to constitute separate tokens, specifically the dollar
;; sign. inputs like |HK$ 7.8| will end up with a bogus token boundary (which
;; is the case too in the original PTB sed(1) script). (15-jul-09; oe)
;;
!^(.+)$ \1
! +
#1
! ([^ (]+)(\)) ([^ ]|$) \1 \2 \3
! ([^ []+)(]) ([^ ]|$) \1 \2 \3
!([^ ])([]}?!,;:@#$€¢£¥%&”"’']) ([^ ]|$) \1 \2 \3
!([^ ])(\.) ([])}”"’'… ]*)$ \1 \2 \3
!(^|[^ ]) ([[({:@#$€¢£¥%&“‘])([^ ]) \1 \2 \3
#
>1
;;
;; _fix_me_
;; to address the above issue about over-tokenizing, for example, |HK$ 7.8|,
;; it might be convenient to introduce non-breaking spaces in the above set of
;; rules, so that we could now put back together |HK $| with peace of mind?
;; (2-dec-12; oe)
;;
!((?:AU?|CA?|HK|NZ|US)?\$)([0-9]+) \1 \2
;;
;; any word-final apostrophe, by now, should be separated (e.g. |abrams'| -->
;; |abrams '|). which only leaves contracted forms, including the undesirable
;; PTB ones, e.g. |don't| --> |do n't|. but not |cannot| --> |can not| and the
;; more obscure ones: |gimme|, |lemme|, |'tis|, |wanna|, et al.
;;
;; _fix_me_
;; the |cannot| case, especially without characterization information, is a bit
;; challenging: it presumably is frequent enough so that for PTB compliance we
;; should pull it apart, but that would seem to introduce unwanted ambiguity.
;; i doubt that |she cannot participate on monday| has the reading of her being
;; able to `not participate' (stay out of the way) on monday. or does it?
;; (19-sep-08; oe)
;;
;; _fix_me_
;; starting in mid-2012, normalize towards UniCode apostrophes (where we used
;; to normalize towards ASCII straight quotes). as a consequence, we no longer
;; expect straight quotes in the `standard' setup, i.e. lexical entries need to
;; be available with UniCode apostrophes. for robustness to non-standard modes
;; (e.g. running without REPP), however, we should either keep duplicates of
;; all lexical entries containing apostrophes, or normalize in token mapping?
;; (26-aug-12; oe)
;;
!([^ ])[’']([dDmMsS]) \1 ’\2
!([^ ])[’'](ll|LL|re|RE|ve|VE) \1 ’\2
!([^ ])([nN])[’']([tT]) \1 \2’\3
;;
;; _fix_me_
;; this is arguable a PTB idiosyncrasy, but (at least while finalizing DeepBank
;; 1.0) we want it in the ‘standard’ parsing configuration. some lines start
;; with lower-case, one-letter identifiers, possibly remnants of footnotes or
;; something similar. (10-feb-13; oe)
;;
!^( *[a-z])(-)([[({“‘0-9A-Z]) \1 \2 \3
>ptb
;;
;; to allow parsing (of inputs involving basic punctuation) in the LKB, there
;; is a REPP module to undo PTB-style separation of tokens. this module will
;; only be activated for use within the LKB, not by preprocess-for-pet().
;;
>lkb