Skip to content

ErgTokenization

StephanOepen edited this page Feb 14, 2009 · 17 revisions

Overview

Aiming for a balance of linguistic precision and broad coverage, the [http://www.delph-in.net/erg English Resource Grammar] (ERG) includes detailed analyses of punctuation and a wide variety of 'text-level' phenomena (e.g. various formats for temporal and numeric expressions). The grammar makes specific assumptions about tokenization, and for the successful application of the grammar it is important to understand and respect these assumptions. In early 2009, the ERG approach to tokenization has undergone a major revision, and this page aims to spell out some of the basic assumptions, specific decisions made, and technology used in preparing input text for parsing with the ERG.

This page was predominantly authored by StephanOepen, who jointly with DanFlickinger developed the current ERG approach to tokenization. As of early 2009, Stephan is the maintainer of the ERG tokenizer and token mapping rules. Please do not make substantial changes to this page unless you (a) are reasonably sure of the technical correctness of your revisions and (b) believe strongly that your changes are compatible with the general design and recommended use patterns for the ERG, and of course with the goals of this page.

String-Level Pre-Processing and Initial Tokenization

Token Mapping

Unknown Word Handling

Clone this wiki locally