-
Notifications
You must be signed in to change notification settings - Fork 4
PetInput
This page discusses some of the available input formats to the PET parser cheap, viz. 'pure' textual input and the so-called YY mode for lattice-based input. These two modes of giving input to the parser are the most traditional ones, but in more recent developments, additional XML-based input formats have been developed. Please see the PetInputChart and SmafTop pages for alternative, lattice-based XML input modes.
This page was predominantly authored by StephanOepen, who is its current maintainer. Please do not make substantial changes unless you (a) are quite certain of the technical correctness of your revisions and (b) believe strongly that your changes are compatible with the general design aqnd recommended use patterns for PET input processing and this page.
By default, cheap expects plain text input, one sentence (or, more generally, utterance) per line. The parser applies a very simple-minded tokenizer, breaking the input string into tokens at all occurences of whitespace. There are a few quirks and configuration options for this input mode, e.g. the ability to convert LaTeX-style accented characters into UniCode characters, or the historic, so-called LinGO tokenizer, trying to handle contracted auxiliaries in (what in the 1990s seemed like) the proper manner.
Punctuation characters, as specified in the settings file are ignored by PET (removed from the input chart) for pure, textual input. Here is an example of the punctuation characters found in pet/japanese.set:
punctuation-characters := "\"!&'()*+,-−./;<=>?@[\]^_`{|}~。?…., ○●◎*".
Note that punctuation-characters are defined separately for the LKB (typically in lkb/globals.lsp) and that, in recent years, grammars are moving towards inclusion of punctuation marks in the syntactic analysis.
Punctuation characters are not removed from the other input modes (YY mode, PET Input Chart, or MAF). Rather, in these modes they should be removed (or treated otherwise, as appropriate) by the preprocessor that created the token lattice (in whatever syntax) provided to PET.
YY (activated by the -yy option) input mode facilities parsing from a partial (lexical) chart, i.e. it assumes that tokenization (and other text-level pre-processing) have been performed outside of cheap. YY input mode facilitates token-level ambiguity, multi-word tokens, some control over what PET should do for morphological analysis, the use of POS tags on input tokens to enable (better) unknown word handling, and generally feeding a word graph (as, for example, obtained from a speech recognizer) into the parser.
Following is a discussion of the YY [http://svn.delph-in.net/erg/trunk/pet/sample.yy input example] provided with the ERG (as of early 2009). In this example, the words are shown on separate lines for clarity. In the actual input given to PET, all YY tokens must appear as a single line (terminated by newline), as each line of input is processed as a separate utterance.
(42, 0, 1, <0:11>, 1, "Tokenization", 0, "null", "NNP" 0.7677 "NN" 0.2323)
(43, 1, 2, <12:12>, 1, ",", 0, "null", "," 1.0000)
(44, 2, 3, <14:14>, 1, "a", 0, "null", "DT" 1.0000)
(45, 3, 4, <16:26>, 1, "non-trivial", 0, "null", "JJ" 1.0000)
(46, 4, 5, <28:35>, 1, "exercise", 0, "null", "NN" 0.9887 "VB" 0.0113)
(47, 5, 6, <36:36>, 1, ",", 0, "null", "," 1.0000)
(48, 6, 7, <38:43>, 1, "bazed", 0, "null", "VBD" 0.5975 "VBN" 0.4025)
(49, 7, 8, <45:57>, 1, "oe@ifi.uio.no", 0, "null", "NN" 0.7342 "JJ" 0.2096)
(50, 8, 9, <58:58>, 1, ".", 0, "null", "." 1.0000)
An input in this form can be processed by PET as follows:
cheap -yy -packing -verbose=4 -mrs \
-chart-mapping -default-les=all english.grm < pet/sample.pet
where -yy (a shorthand for -tok=yy) turns on YY partial chart input mode, and we request ambiguity packing (which is always a good idea), some verbosity of tracing, and the output of MRSs. The additional options enable chart mapping (see [http://www.lrec-conf.org/proceedings/lrec2008/summaries/349.html Adolphs, et al. (2008)]) and turn the unknown word machinery into 2008 mode (see the section Unknown Word Handling below). Note that these options, as of early 2009, are only supported in the so-called chart mapping [https://pet.opendfki.de/repos/pet/branches/cm branch] of the PET code base (corresponding pre-compiled binaries are available in the LOGON tree; see the LogonTop page).
Each token in the above example has the following format:
- (id, start, end, [link,] path+, form [surface], ipos, lrule+[, {pos p}+])
i.e. each token has a unique identifier and start and end vertex. Optionally, tokens can be annotated with a surface link, an indication of underlying string positions in the original document; currently (as of January 2009), link information is only supported as character positions, in the format <from:to> (but in principle, link could have other forms, with from and to being arbitrary strings, e.g. stand-off pointers in whatever underlying markup). We will ignore the path component (membership in one or more paths through a word lattice) for our purposes.
The actual token string is provided by the form field, and this is what PET uses for morphological analysis and lexical look-up. In case the form does not correspond to the original string in the document, e.g. because there was some textual normalization prior to creation of YY tokens already, the optional surface field can be used to record the original string. Until early 2009, the ERG had inherited a mechanism called ersatzing where a set of regular expressions were applied prior to parsing, associating for example a form value of EmailErsatz with a surface value of oe@yy.com. In the newer, chart mapping universe, the ERG no longer makes use of this facility and instead makes it a policy to never 'mess' with the actual token string (but use other token properties instead).
YY mode can be used in two variants regarding morphological analysis. Our example above leaves morphological analysis to PET, i.e. using the lexical rules and orthographemic annotation provided by the grammar. This built-in morphology mode is activated by an lrules value of "null", and the ipos field is ignored (but still has to be given, conventionally as 0). Another option is to provide information about morphological segmentation as part of the input tokens, in which case ipos specifies the position to which orthographemic rules apply, and one or more lrule values (as strings) name lexical rules provided by the grammar.
Finally, each token can be annotated with an optional sequence of tag plus probability pairs. The ERG, for example, includes a set of underspecified generic lexical entries which can be activated on the basis of PoS information, obtained for example from running a PoS tagger prior to parsing. We used to include the probabilities in (heuristic) parse ranking, but since sometime in 2002 (when MaxEnt parse selection became available in PET) they are just ignored.
YY input mode supports a genuine token lattice, i.e. it is legitimate to have multiple tokens for an input position, or tokens spanning multiple positions.
As of early 2009, there are two modes of detecting and handling unknown words, i.e. input tokens for which no native lexical entry is available. Common to both modes is their use of underspecified, so-called generic lexical entries. In a nutshell, these entries are instantiated for gaps in the lexical chart, i.e. input positions for which no native entries were found. The variation in different modes of unknown word handling relates to (a) how lexical gaps are detected and (b) the selection of which generic entries to instantiate. For historic reasons, we document the older unknown word handling mode first, pointing out its limitations along the way. The newer, cleaner, and more flexible approach to unknown word handling is summarized in section Unknown Word Handling and Chart Mapping below.
Unknown word handling is activated by the command-line option -default-les. For this option to take effect, the grammar has to provide one or more lexical entries marked as generic, by means of their TDL status value. For example, the ERG includes the following declartions (in pet/common.set):
generic-lexentry-status-values := generic-lex-entry.
Actual generic entries are defined in the ERG file [http://svn.delph-in.net/erg/trunk/gle.tdl], which is loaded (in the top-level grammar file english.tdl) as follows:
:begin :instance :status generic-lex-entry.
:include "gle".
:end :instance.
Turning on -default-les without additional settings, for each lexical gap all generic entries will be activated; in other words, there is no control over which entries are used at each gap position, and it is left to the larger syntactic context to determine the category of the unknown token(s). With inputs exhibiting a non-trivial proportion of unknown words, this approach can lead to massive lexical and syntactic ambiguity and, in the worst case, may be computationally intractable.
Since around 2002 the ERG has had the ability of using an external PoS tagger to selectively activate generic entries; this mode of operation assumes that input tokens are decorated with one or more PoS tags (as in our example above), and that the grammar provides a mapping from PoS tags to the identifiers of generic lexical entries. This mapping can be provided by the posmapping declaration in one of the settings files, for example (from older versions of the ERG):
posmapping :=
JJ $generic_adj
JJR $generic_adj_compar
JJS $generic_adj_superl
CD $generic_number
NN $generic_mass_count_noun
NNS $generic_pl_noun
NNPS $generic_pl_noun
NNP $genericname
FW $generic_mass_noun
RB $generic_adverb
VB $generic_trans_verb_bse
VBD $generic_trans_verb_past
VBG $generic_trans_verb_prp
VBN $generic_trans_verb_psp
VBP $generic_trans_verb_presn3sg
VBZ $generic_trans_verb_pres3sg
.
To further constrain the postulation of generic lexical entries, cheap provides two optional filtering mechanisms (both somewhat ad-hoc). The first of these can be used to impose suffix constraints on the actual token string giving rise to a generic lexical entry. For example (again from older ERG revisions):
generic-le-suffixes :=
$generic_trans_verb_pres3sg "S"
$generic_trans_verb_past "ED"
$generic_trans_verb_psp "ED"
$generic_trans_verb_prp "ING"
$generic_pl_noun "S"
.
But this approach interoperates poorly with the ERG treatment of punctuation (as pseudo-affixes), which was introduced sometime around 2005.
Another configuration mechanism can be used to let PoS tags augment native lexical entries, i.e. attempting to address incomplete lexical coverage, say a use of the word bus as a verb, but assuming the native lexicon only provides a nominal reading. However, seeing that recent developments have made this configuration obsolete too (where it was never really used in production anyway), it shall suffice to 'document' it by means of the comments from the file pet/common.set in earlier ERG revisions:
;;;
;;; the setting `pos-completion' enables an additional mechanism to do with
;;; processing of generic lexical entrie: whenever we receive POS information
;;; as part of the input, we check to see whether the built-in lexical entries
;;; suffice to satisfy the POS annotations: each lexical entry retrieved for an
;;; input token
;;;
;;; <string, pos_1, pos_2, pos_3>
;;;
;;; is mapped to an application-specific POS tag, using the `type-to-pos' map,
;;; and checking the type of each lexical entry for subsumption against the
;;; left-hand side of each `type-to-pos' rule. some or all POS annotations
;;; from the input may be `satisfied' under this mapping by built-in lexical
;;; entries, e.g. for the example above, there may be lexical entries whose
;;; type maps to `pos_1' and `pos_3'; unless all POS annotations are satisfied
;;; after all built-in lexical entries have been processed, the remaining POS
;;; categories are processed by the regular `posmapping' look-up. note that,
;;; as a side effect, an empty `type-to-pos' map will always result in having
;;; all generic lexical entries activated (modulo the filter described above),
;;; even for input tokens that were found in the native lexicon.
;;;
#|
pos-completion.
type-to-pos :=
basic_noun_word NN
basic_noun_word NNS
basic_noun_word NNP
basic_pronoun_word NN
basic_pronoun_word NNS
basic_pronoun_word NNP
.
|#
The approach to unknown word handling sketched above has several undesirable properties (that we discovered through the years).
First, the method of detecting lexical gaps right after lexical look-up and activating generic entries only in chart positions with apparent gaps is failure-prone under two conditions: first, lexical look-up is performed on the basis of stems that were hypothesized after the first phase of orthographemic processing (i.e. morphological analysis). In this phase, an input token UPS will be analyzed as the candidate stem up, combined with either the nominal plural, or the verbal present tense, third person singular inflectional rules. If we assumed that the grammar contained only the prepositional lexical entry for up, then both morphological analyses should actually be considered lexical gaps—in lexical parsing, the inflectional rules will ultimately fail to apply to the preposition. If we were to assume, on the other hand, that the grammar contained a verbal stem up, then there is no lexical gap in a technical sense. However, there may still be incomplete lexical coverage, where instead of the present tense verb form, we would rather require a (generic) proper name to parse successfully. The original approach to unknown words in PET has no satisfactory way of addressing either problem.
For these reasons, the order and nature of lexical processing have been re-worked in the so-called chart mapping parsing mode for cheap. [http://www.lrec-conf.org/proceedings/lrec2008/summaries/349.html Adolphs, et al. (2008)] discuss this approach more generally, but in a nutshell this mode defers gap detection and the decision on which generics to use until after lexical processing is complete, i.e. the application of lexical rules has reached a fix-point. This universe is activated by the command-line option -default-les=all and should typically be combined with -chart-mapping. In this mode, the processing of native lexical entries is unchanged, but generics are treated differently: for each input token, all generics are always activated.
YY input mode was first developed in 2000 and has undergone three revisions since. YY input mode revision 0.0 was a purely internal version that is no longer supported. Since 2001, YY 1.0 has been in active use and is still fully supported. The format described above, and the example given from the ERG, use YY 2.0, a conservative, backwards-compatible extension made in January 2009. Compared to YY 1.0, only the optional link field was added, i.e. the ability to provide information about external surface positions. It appears, however, that the PET-internal treatment of YY input tokens was changed in a (in principle) non-backwards-compatible way sometime around the years 2003 or 2004, where the start and end files (in YY 1.0) format were re-interpreted as external surface links, viz. character positions—much like the new from and to values in the YY 2.0 extension. No real damage was observed from this change (because interpreting chart vertices as character positions, and later re-computing chart vertices from the resulting lattice topology should usually arrive at an identical lattice), but as of early 2009, it is recommend to adapt external providers of YY input to PET to the richer YY 2.0 format.
Alternate, lattice-based input modes are available using XML markup to encode the parser input. See the PetInputChart and SmafTop pages for the so-called PIC and SMAF mode, respectively.
Home | Forum | Discussions | Events