Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactoring of special tokens #4

Open
GoogleCodeExporter opened this issue Aug 10, 2015 · 0 comments
Open

refactoring of special tokens #4

GoogleCodeExporter opened this issue Aug 10, 2015 · 0 comments

Comments

@GoogleCodeExporter
Copy link

special tokens (class based emission probs) are important features of
hunpos and TnT. 

For the following regular expressions hunpos learns the tag distribution of
the training corpus separately to give more reliable estimates for open
class items like numbers unseen during training:

^[0-9]+$ 
^[0-9]+\.$      
^[0-9.,:-]+[0-9]+$
^[0-9]+[a-zA-Z]{1,3}$ 

After this, at tag time, if the word is not found in the lexicon
(numerals are added to the lexicon like all other items) hunpos checks
whether  the unseen word matches some of the regexps, and uses the
distribution learned for this regexp to guess the tag.

Now these regexpr are hardcoded in special_tokens.ml file. Need some very
fast regexp matching or something like tranducers.

Original issue reported on code.google.com by hala...@gmail.com on 30 Jun 2007 at 11:54

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant