refactoring of special tokens #4

GoogleCodeExporter · 2015-08-10T06:55:34Z

special tokens (class based emission probs) are important features of
hunpos and TnT. 

For the following regular expressions hunpos learns the tag distribution of
the training corpus separately to give more reliable estimates for open
class items like numbers unseen during training:

^[0-9]+$ 
^[0-9]+\.$      
^[0-9.,:-]+[0-9]+$
^[0-9]+[a-zA-Z]{1,3}$ 

After this, at tag time, if the word is not found in the lexicon
(numerals are added to the lexicon like all other items) hunpos checks
whether  the unseen word matches some of the regexps, and uses the
distribution learned for this regexp to guess the tag.

Now these regexpr are hardcoded in special_tokens.ml file. Need some very
fast regexp matching or something like tranducers.

Original issue reported on code.google.com by hala...@gmail.com on 30 Jun 2007 at 11:54

The text was updated successfully, but these errors were encountered:

GoogleCodeExporter added Type-Enhancement auto-migrated Priority-High labels Aug 10, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactoring of special tokens #4

refactoring of special tokens #4

GoogleCodeExporter commented Aug 10, 2015

refactoring of special tokens #4

refactoring of special tokens #4

Comments

GoogleCodeExporter commented Aug 10, 2015