-
Notifications
You must be signed in to change notification settings - Fork 6
Statistical Learning
An alternative or even complimentary method of handling sentences based on parsing templates is to use statistical methods from machine learning. See
- https://en.wikipedia.org/wiki/Machine_learning
- https://en.wikipedia.org/wiki/Natural_language_processing
- https://en.wikipedia.org/wiki/Bag-of-words_model
- https://en.wikipedia.org/wiki/Word2vec
- https://en.wikipedia.org/wiki/Deep_learning
The idea is that if a sentence fails to parse appropriately to any regex from any template, then it would be better to parse it using statistical methods.
First, a model would need to be trained using supervised learning, where a label has been associated to each training sentence. That is, we have a dataset of the form (sentence, handler). For example
"Dogs are black", is_handler
"Dogs have tails", has_handler
"My dog is black", color_handler
Note that in the above example, the last sentence is ambiguous and could be handled by both "is_handler" and "color_handler". This is by design and so, both handlers should be able to parse and handle this sentence correctly. Of course, another way to see it is that "color_handler" is a more specialized form of "is_handler" and that is correct. In fact, is_handler should be able to determine if the attribute is a color and then call "color_handler" subsequently.
Having specialized sub-handlers, such as "color_handler", allows the statistical model to have more information to work with. On the other hand, if "is_handler" can already parse this type of sentence, then perhaps this information is needlessly redundant and just complicates both the model and the training datasets.
It is not clear at all at this point which is the better approach.