erg/tokenizer.rpp

;;; -*- mode: fundamental; coding: utf-8; indent-tabs-mode: t; -*-

;;;
;;; Copyright (c) 2009 -- 2010 Stephan Oepen (oe@ifi.uio.no); 
;;; see `LICENSE' for conditions.
;;;


;;;
;;; another shot at a finite-state language for preprocessing, normalization,
;;; and tokenization in LKB grammars.  requires LKB version of 1-feb-09 or
;;; newer.  note that the syntax is rigid: everything starting in column 2
;;; (i.e. right after the rule type marker) is used as the match pattern until
;;; the first `\t' (tabulator sign); one or more tabulators are considered the
;;; separator between the matching pattern and the replacement, but other
;;; whitespace will be considered part of the patterns.  empty lines or lines
;;; with a semicolon in column 1 (i.e. in place of the rule type marker, this
;;; is not Lisp) will be ignored.
;;;
;;; this is a fresh attempt (as of September 2008) at input tokenization.  for
;;; increased compatibility with existing tools (specifically taggers trained
;;; on the PTB), we now assume a PTB-like tokenization in pre-processing.  the
;;; grammar includes token mapping rules (using the new chart mapping machinery
;;; in PET) to eventually adjust (i.e. correct, in some cases) tokenization to
;;; its needs.  specifically, many punctuation marks will be re-combined with 
;;; preceding or following tokens, reflecting standard orthographic convention,
;;; and are then analyzed as pseudo-affixes.  
;;;
;;; this file is inspired by the PTB `tokenizer.sed' script, and by and large
;;; should yield very similar results.  with the addition of token mapping as
;;; a separate step inside the parser, we want to restrict RE-based processing
;;; to pure string-level phenomena.  however, to actually tokenize (following
;;; some set of principles), we need to do more than just break at whitespace.
;;; some punctuation marks give rise to token boundaries, but not all.  also,
;;; inputs (in the 21st century) may contain some amount of mark-up, where XML
;;; character references have become relatively common.  full UniCode support
;;; now makes it possible to represent a much larger range of characters, e.g.
;;; various types of quotes and dashes.  we aim to map mark-up to corresponding
;;; UniCode characters, and preserve those in parsing, as much as possible.
;;;
;;; the original `tokenizer.sed' script actually cannot always yield the exact
;;; tokenization found in the PTB.  the script unconditionally separates a set
;;; of punctuation or other non-alphanumeric characters (e.g. |&| and |!|) that
;;; may be part of a single token (say in |AT&T| or URLs).  we aim to do better
;;; than the original script, here, conditioning on adjacent whitespace.
;;;


;;
;; preprocessor rules versioning; auto-maintained upon CVS (or SVN) check-in.
;; 
@$Date: 2014-10-07 00:00:51 +0200 (Tue, 07 Oct 2014) $

;;
;; tokenization pattern: after normalization, the string will be broken up at
;; each occurrence of this pattern; the pattern match itself is deleted.
;;
:[ \t]+

;;;
;;; string rewrite rules: all matches, over the entire string, are replaced by
;;; the right-hand side; grouping (using `(' and `)') in the pattern) and group
;;; references (`\1' for the first group, et al.) carry over part of the match.
;;;

;;
;; pad the full string with trailing and leading whitespace; makes matches for
;; word boundaries a little easier down the road; also, squash multiple spaces
;; and replace tabulators and ‘weird’ spaces with an ASCII space.
;;
;; _fix_me_
;; in the 21st century, there are of course various whitespace code points, so
;; maybe we should look for a UniCode property to use here.  however, it used
;; to be the case, at least, that the LKB implementation of REPP did not have
;; the full UniCode CL-PPCRE extensions (and even if it did, one might first
;; have to read up on interactions of locales and UniCode properties, before
;; relying on such fancy whitespace detection.                 (29-nove-12; oe)
;;
!^(.+)$								 \1 
![\t　]								 
!  +								 

;; a set of `mark-up modules', often replacing mark-up character entities with
;; actual UniCode characters (e.g. |&mdash;| or |---| as |—|), or just ditching
;; mark-up that has no bearing on parsing for now (e.g. most wiki mark-up).
;; these modules can be activated selectively by name in the REPP environment
;; or the top-level call into REPP.

>xml
>latex
>ascii
>html
>wiki
>lgt
>gml

;;
;; we hope to keep it to a minimum and only invoke it in select configurations,
;; but at times we need to heuristically `fix up' deficiencies in the input;
;; for example spurious duplicate punctuation, and such.
;;
>robustness

;;
;; our treatment of quote marks continues to evolve.  as of february 2010, it
;; seems we need to support two distinct scenarios: (a) inputs using standard
;; typography (quotes either are directional already or can be diambiguated by
;; adjacency to preceding or following whitespace) vs. (b) inputs that already
;; have been pre-tokenized or otherwise damaged (as in the current CoNLL 2010
;; training data), where we stand no chance of disambiguating quote marks.  in
;; downstream processing, we thus need to handle six types of quotes: genuine
;; directional quotes (|“|, |”|, |‘|, and |’|) and ambiguous, straight ones
;; (|"| and |'|).  the former group can be restricted to opening and closing
;; functions, and in the case of closing quotes also to apostrophes and units
;; of measure (feet and inches, or seconds and minutes; for which in `proper'
;; typography the double prime and prime symbols could be used); the straight
;; quotes, on the other hand, need to be considered ambiguous between opening
;; and closing, apostrophe, or prime usages.                    (9-feb-10; oe)
;;
>quotes

;;
;; two special cases involving periods: map ASCII ellipsis (|...|) to a single
;; UniCode character (|…|), and convert |..| between numbers into an n-dash,
;; i.e. a numeric range (typically tokenized off, i.e. |42| |–| |43|).  maybe
;; the latter can also occur between non-numbers?  we could also just preserve
;; it, but always make it a token in its own right?
;;
;; _fix_me_
;; what about a sentence-final period following the ellipsis (as in cb/7060)?
;;                                                              (24-sep-08; oe)
;; _fix_me_
;; several years later, i believe the ellipsis (|…|) should just be considered
;; a suffix punctuation mark (possibly contributing a piece of semantics) that
;; contributes the same punctuation quality as token-final periods.  however,
;; so far we have kept periods following ellipsis as separate tokens, and this
;; also appears to be the analysis in the venerable PTB (and hence we cannot
;; revise it while finalizing DeepBank 1.0).  at the same time, the PTB has a
;; number of sentences ending in just ellipsis, and (for maximum compliance) 
;; we want to normalize these to the UniCode glyph too.         (10-feb-13; oe)
;;
!(?:(?:\. ){3,}|(?:\.){3,})(?![])}”’ ]*$)			…
!(?:\. ){3}|(?:\.){3}						…
!([0-9]) *\.\. *([0-9])						\1 – \2


;;
;; some UniCode characters force token boundaries: m-dash, ellipsis.  we used
;; to include n-dashes in this list, but some authors (e.g. ESR) use n-dashes
;; much like hyphens, attempting to be clever about bracketing in expressions
;; like |non–source-aware| [users].
;;
!([—…])								 \1 

;;
;; remove space after initial |O'| and |L'|, i.e. irish and romance names, to
;; avoid stripping off their apostrophes.
;;
! ([OL])['’] 							 \1’

;;
;; a new REPP facility: named groups and iterative group calls.  there are a
;; number of characters that PTB tokenizes off (unconditionally, it seems, in
;; the original `tokenizer.sed'), though not when they are parts of names or
;; NE patterns, e.g. |AT&T| or |http://www.emmtee.net/?foo.php&bar=42|.  thus,
;; we only want these as separate tokens when they are preceded or followed by
;; whitespace; this leaves a problem with, say, |http://www.emmtee.net/|, where
;; one would have to apply NE recognition (what used to be `ersatzing') _prior_
;; to tokenization.
;;
;; either way, because characters we want to tokenize off might be `clustered'
;; with each other, e.g. |(42%), |, the notion of adjacent whitespace needs to
;; apply transitively through such clusters.  it seems an iterative group is
;; the most straightforward way of getting that effect.  the rules from the
;; group will be applied repeatedly (in order) at the time the group is called
;; (by means of the `>' operator), until there are no further matches.  we need
;; to be careful to avoid indefinite recursion within the group, i.e. not add
;; duplicate spaces.  thus, once again, make sure the full string is bracketed
;; by single spaces, and ditch multiple spaces initially; then we can require a
;; non-whitespace preceding or following context, besides the actual spaces.
;;
;; at this point, we exclude a few punctuation characters from this policy, in
;; part because that is the PTB approach (|-| and |/|), in part because they
;; can be prefixes or suffixes of one-token named entities, i.e. |<| and |>| in
;; URLs and email addresses.  to work around these, we may need a string-level
;; `ersatzing' facility, associating a sub-string (that can be unambiguously 
;; identified by surface properties, e.g. a URL) with an identifier of a token
;; class.
;;
;; like in the original PTB script, periods are only tokenized off in sentence-
;; final position, maybe followed only by closing quote marks or parentheses.
;;
;; the story for parentheses and square brackets is somehwat involved, to avoid
;; breaking up inputs like |factor(s)| or |Ca[2+]|.
;;
;; _fix_me_
;; our current solution unconditionally strips off leading parentheses, which 
;; would break an input like |(mis-)read|.  talk to dan about how best to deal
;; with these (maybe see whether we find actual examples).     (26-feb-10; oe)
;;
;; _fix_me_
;; there is an issue with some of the characters that are asserted (at least in
;; whitespace adjacency) to constitute separate tokens, specifically the dollar
;; sign.  inputs like |HK$ 7.8| will end up with a bogus token boundary (which
;; is the case too in the original PTB sed(1) script).         (15-jul-09; oe)
;;
!^(.+)$								 \1 
!  +								 
#1
! ([^ (]+)(\)) ([^ ]|$)						 \1 \2 \3
! ([^ []+)(]) ([^ ]|$)						 \1 \2 \3
!([^ ])([]}?!,;:@#$€¢£¥%&”"’']) ([^ ]|$)			\1 \2 \3
!([^ ])(\.) ([])}”"’'… ]*)$					\1 \2 \3
!(^|[^ ]) ([[({:@#$€¢£¥%&“‘])([^ ])				\1 \2 \3
#

>1

;;
;; _fix_me_
;; to address the above issue about over-tokenizing, for example, |HK$ 7.8|, 
;; it might be convenient to introduce non-breaking spaces in the above set of
;; rules, so that we could now put back together |HK $| with peace of mind?
;;                                                               (2-dec-12; oe)
;;
!((?:AU?|CA?|HK|NZ|US)?\$)([0-9]+)				\1 \2

;;
;; any word-final apostrophe, by now, should be separated (e.g. |abrams'| -->
;; |abrams '|).  which only leaves contracted forms, including the undesirable
;; PTB ones, e.g. |don't| --> |do n't|.  but not |cannot| --> |can not| and the
;; more obscure ones: |gimme|, |lemme|, |'tis|, |wanna|, et al.
;;
;; _fix_me_
;; the |cannot| case, especially without characterization information, is a bit
;; challenging: it presumably is frequent enough so that for PTB compliance we
;; should pull it apart, but that would seem to introduce unwanted ambiguity.
;; i doubt that |she cannot participate on monday| has the reading of her being
;; able to `not participate' (stay out of the way) on monday.  or does it?
;;                                                              (19-sep-08; oe)
;;
;; _fix_me_
;; starting in mid-2012, normalize towards UniCode apostrophes (where we used
;; to normalize towards ASCII straight quotes).  as a consequence, we no longer
;; expect straight quotes in the `standard' setup, i.e. lexical entries need to
;; be available with UniCode apostrophes.  for robustness to non-standard modes
;; (e.g. running without REPP), however, we should either keep duplicates of
;; all lexical entries containing apostrophes, or normalize in token mapping?
;;                                                              (26-aug-12; oe)
;;
!([^ ])[’']([dDmMsS]) 						\1 ’\2 
!([^ ])[’'](ll|LL|re|RE|ve|VE) 					\1 ’\2 
!([^ ])([nN])[’']([tT]) 					\1 \2’\3 

;;
;; _fix_me_
;; this is arguable a PTB idiosyncrasy, but (at least while finalizing DeepBank
;; 1.0) we want it in the ‘standard’ parsing configuration.  some lines start
;; with lower-case, one-letter identifiers, possibly remnants of footnotes or
;; something similar.                                           (10-feb-13; oe)
;;
!^( *[a-z])(-)([[({“‘0-9A-Z])					\1 \2 \3

>ptb

;;
;; to allow parsing (of inputs involving basic punctuation) in the LKB, there 
;; is a REPP module to undo PTB-style separation of tokens.  this module will
;; only be activated for use within the LKB, not by preprocess-for-pet().
;;
>lkb