Skip to content

PetInput

StephanOepen edited this page Jan 25, 2009 · 53 revisions

Overview

This page discusses some of the available input formats to the PET parser cheap, viz. 'pure' textual input and the so-called YY mode for lattice-based input. These two modes of giving input to the parser are the most traditional ones, but in more recent developments, additional XML-based input formats have been developed. Please see the PetInputChart and SmafTop pages for alternative, lattice-based XML input modes.

Textual, Line-Oriented Input

By default, cheap expects plain text input, one sentence (or, more generally, utterance) per line. The parser applies a very simple-minded tokenizer, breaking the input string into tokens at all occurences of whitespace. There are a few quirks and configuration options for this input mode, e.g. the ability to convert LaTeX-style accented characters into UniCode characters, or the historic, so-called LinGO tokenizer, trying to handle contracted auxiliaries in (what in the 1990s seemed like) the proper manner.

Punctuation characters, as specified in the settings file are ignored by PET (removed from the input chart) for pure, textual input.

Here is an example of the punctuation characters found in pet/japanese.set:

  punctuation-characters := "\"!&'()*+,-−./;<=>?@[\]^_`{|}~。?…., ○●◎*".

Note that punctuation-characters are defined separately for the LKB (typically in lkb/globals.lsp) and that, in recent years, grammars are moving towards inclusion of punctuation marks in the syntactic analysis.

Punctuation characters are not removed from the other input modes (YY mode, PET Input Chart, or MAF). Rather, in these modes they should be removed (or treated otherwise, as appropriate) by the preprocessor that created the token lattice (in whatever syntax) provided to PET.

YY Input Mode

YY (activated by the -yy option) input mode facilities parsing from a partial (lexical) chart, i.e. it assumes that tokenization (and other text-level pre-processing) have been performed outside of cheap. YY input mode facilitates token-level ambiguity, multi-word tokens, some control over what PET should do for morphological analysis, the use of POS tags on input tokens to enable (better) unknown word handling, and generally feeding a word graph (as, for example, obtained from a speech recognizer) into the parser.

Following is a discussion of the YY [http://svn.delph-in.net/erg/trunk/pet/sample.yy input example] provided with the ERG (as of early 2009). In this example, the words are shown on separate lines for clarity. In the actual input given to PET, all YY tokens must appear as a single line (terminated by newline), as each line of input is processed as a separate utterance.

  (42, 0, 1, <0:11>, 1, "Tokenization", 0, "null", "NNP" 0.7677 "NN" 0.2323)
  (43, 1, 2, <12:12>, 1, ",", 0, "null", "," 1.0000)
  (44, 2, 3, <14:14>, 1, "a", 0, "null", "DT" 1.0000)
  (45, 3, 4, <16:26>, 1, "non-trivial", 0, "null", "JJ" 1.0000)
  (46, 4, 5, <28:35>, 1, "exercise", 0, "null", "NN" 0.9887 "VB" 0.0113)
  (47, 5, 6, <36:36>, 1, ",", 0, "null", "," 1.0000)
  (48, 6, 7, <38:43>, 1, "bazed", 0, "null", "VBD" 0.5975 "VBN" 0.4025)
  (49, 7, 8, <45:57>, 1, "oe@ifi.uio.no", 0, "null", "NN" 0.7342 "JJ" 0.2096)
  (50, 8, 9, <58:58>, 1, ".", 0, "null", "." 1.0000)

An input in this form can be processed by PET as follows:

  cheap -yy -packing -verbose=4 -mrs \
    -chart-mapping -default-les=all english.grm < pet/sample.pet

where -yy (a shorthand for -tok=yy) turns on YY partial chart input mode and we request ambiguity packing (which is always a good idea), some verbosity of tracing, and the output of MRSs. The additional options enable chart mapping (see [http://www.lrec-conf.org/proceedings/lrec2008/summaries/349.html Adolphs, et al. (2008)]) and turn the unknown word machinery into 2008 mode (see the section Unknown Word Handling below). Note that these options, as of early 2009, are only supported in the so-called chart mapping [https://pet.opendfki.de/repos/pet/branches/cm branch] of the PET code base (corresponding pre-compiled binaries are available in the LOGON tree; see the LogonTop page).

Each token in the above example has the following format:

  • (id, start, end, [link,] path+, form [surface], ipos, lrule+[, {pos p}+])

i.e. each token has a unique identifier and start and end vertex. Optionally, tokens can be annotated with a surface link, an indication of underlying string positions in the original document; currently (as of January 2009), link information is only supported as character positions, in the format <from:to> (but in principle, link could have other forms, with from and to being arbitrary strings, e.g. stand-off pointers in whatever underlying markup). We will ignore the path component (membership in one or more paths through a word lattice) for our purposes.

The actual token string is provided by the form field, and this is what PET uses for morphological analysis and lexical look-up. In case the form does not correspond to the original string in the document, e.g. because there was some textual normalization prior to creation of YY tokens already, the optional surface field can be used to record the original string. Until early 2009, the ERG had inherited a mechanism called ersatzing where a set of regular expressions were applied prior to parsing, associating for example a form value of EmailErsatz with a surface value of oe@yy.com. In the newer, chart mapping universe, the ERG no longer makes use of this facility and instead makes it a policy to never 'mess' with the actual token string (but use other token properties instead).

YY mode can be used in two variants regarding morphological analysis. Our example above leaves morphological analysis to PET, i.e. using the lexical rules and orthographemic annotation provided by the grammar. This built-in morphology mode is activated by an lrules value of "null", and the ipos field is ignored (but still has to be given, conventionally as 0). Another option is to provide information about morphological segmentation as part of the input tokens, in which case ipos specifies the position to which orthographemic rules apply, and one or more lrule values (as strings) name lexical rules provided by the grammar.

Finally, each token can be annotated with an optional sequence of tag plus probability pairs. The ERG, for example, includes a set of underspecified generic lexical entries which can be activated on the basis of PoS information, obtained for example from running a PoS tagger prior to parsing. We used to include the probabilities in (heuristic) parse ranking, but since sometime in 2002 (when MaxEnt parse selection became available in PET) they are just ignored.

YY input mode supports a genuine token lattice, i.e. it is legitimate to have multiple tokens for an input position, or tokens spanning multiple positions.

Unknown Word Handling

As of early 2009, there are two modes of detecting and handling unknown words, i.e. input tokens for which no native lexical entry is available. Common to both modes is their use of underspecified, so-called generic lexical entries. In a nutshell, these entries are instantiated for gaps in the lexical chart, i.e. input positions for which no native entries were found. The variation in different modes of unknown word handling relates to (a) how lexical gaps are detected and (b) the selection of which generic entries to instantiate.

Unknown word handling is activated by the command-line option -default-les. For this option to take effect, the grammar has to provide one or more lexical entries marked as generic, by means of their TDL status value. For example, the ERG includes the following declartions (in pet/common.set):

  generic-lexentry-status-values := generic-lex-entry.

Actual generic entries are defined in the ERG file [http://svn.delph-in.net/erg/trunk/gle.tdl], which is loaded (in the top-level grammar file english.tdl) as follows:

  :begin :instance :status generic-lex-entry.
  :include "gle".
  :end :instance.

Turning on -default-les without additional settings, for each lexical gap all generic entries will be activated; in other words, there is no control over which entries are used at each gap position, and it is left to the larger syntactic context to determine the category of the unknown token(s). With inputs exhibiting a non-trivial proportion of unknown words, this approach can lead to massive lexical and syntactic ambiguity and, in the worst case, may be computationally intractable.

Between around 2002 and 2008, the ERG has had the ability of using an external PoS tagger to selectively activate generic entries; this mode of operation assumes that input tokens are decorated with one or more PoS tags (as in our example above), and that the grammar provides a mapping from PoS tags to the identifiers of generic lexical entries. This mapping can be provided by the posmapping declaration in one of the settings files, for example (from older versions of the ERG):

  posmapping := 
    JJ $generic_adj
    JJR $generic_adj_compar
    JJS $generic_adj_superl
    CD $generic_number
    NN $generic_mass_count_noun
    NNS $generic_pl_noun
    NNPS $generic_pl_noun
    NNP $genericname
    FW $generic_mass_noun
    RB $generic_adverb
    VB $generic_trans_verb_bse
    VBD $generic_trans_verb_past
    VBG $generic_trans_verb_prp
    VBN $generic_trans_verb_psp
    VBP $generic_trans_verb_presn3sg
    VBZ $generic_trans_verb_pres3sg
  .

To further constrain the postulation of generic lexical entries, cheap provides two optional filtering mechanisms (both somewhat ad-hoc). The first of these can be used to impose suffix constraints on the actual token string giving rise to a generic lexical entry. For example (again from older ERG revisions):

  generic-le-suffixes := 
    $generic_trans_verb_pres3sg "S" 
    $generic_trans_verb_past "ED" 
    $generic_trans_verb_psp "ED" 
    $generic_trans_verb_prp "ING" 
    $generic_pl_noun "S"
  .

But this approach interoperates poorly with the ERG treatment of punctuation (as pseudo-affixes), which was introduced sometime around 2005.

Another configuration mechanism can be used to let PoS tags augment native lexical entries, i.e. attempting to address incomplete lexical coverage, say a use of the word bus as a verb, but assuming the native lexicon only provides a nominal reading. However, seeing that recent developments have made this configuration obsolete too (where it was never really used in production anyway), it shall suffice to 'document' it by means of the comments from the file pet/common.set in earlier ERG revisions:

  ;;;
  ;;; the setting `pos-completion' enables an additional mechanism to do with
  ;;; processing of generic lexical entrie: whenever we receive POS information
  ;;; as part of the input, we check to see whether the built-in lexical entries
  ;;; suffice to satisfy the POS annotations: each lexical entry retrieved for an
  ;;; input token 
  ;;;
  ;;;   <string, pos_1, pos_2, pos_3> 
  ;;;
  ;;; is mapped to an application-specific POS tag, using the `type-to-pos' map,
  ;;; and checking the type of each lexical entry for subsumption against the
  ;;; left-hand side of each `type-to-pos' rule.  some or all POS annotations
  ;;; from the input may be `satisfied' under this mapping by built-in lexical
  ;;; entries, e.g. for the example above, there may be lexical entries whose
  ;;; type maps to `pos_1' and `pos_3'; unless all POS annotations are satisfied
  ;;; after all built-in lexical entries have been processed, the remaining POS
  ;;; categories are processed by the regular `posmapping' look-up.  note that,
  ;;; as a side effect, an empty `type-to-pos' map will always result in having
  ;;; all generic lexical entries activated (modulo the filter described above),
  ;;; even for input tokens that were found in the native lexicon.
  ;;;
  #|
  pos-completion.
  type-to-pos :=
    basic_noun_word NN
    basic_noun_word NNS
    basic_noun_word NNP
    basic_pronoun_word NN
    basic_pronoun_word NNS
    basic_pronoun_word NNP
  .
  |#

History and Alternate Lattice-Based Input Modes

YY input mode was first developed in 2000 and has undergone three revisions since. YY input mode revision 0.0 was a purely internal version that is no longer supported. Since 2001, YY 1.0 has been in active use and is still fully supported. The format described above, and the example given from the ERG, use YY 2.0, a conservative, backwards-compatible extension made in January 2009. Compared to YY 1.0, only the optional link field was added, i.e. the ability to provide information about external surface positions. It appears, however, that the PET-internal treatment of YY input tokens was changed in a (in principle) non-backwards-compatible way sometime around the years 2003 or 2004, where the start and end files (in YY 1.0) format were re-interpreted as external surface links, viz. character positions—much like the new from and to values in the YY 2.0 extension. No real damage was observed from this change (because interpreting chart vertices as character positions, and later re-computing chart vertices from the resulting lattice topology should usually arrive at an identical lattice), but as of early 2009, it is recommend to adapt external providers of YY input to PET to the richer YY 2.0 format.

Alternate, lattice-based input modes are available using XML markup to encode the parser input. See the PetInputChart and SmafTop pages for the so-called PIC and SMAF mode, respectively.

Clone this wiki locally