Skip to content

Statical learning: Datasets

Phil Paradis edited this page Sep 18, 2016 · 1 revision

The main drawback of statistical learning is that it requires huge quantities of data before it becomes useful and sufficiently smart for our purposes.

It will be nearly impossible to produce all the required data by hand, hence the need to automate as much of the work as possible.

Here, we need to identify sources of datasets that can be used without too much effort to train the statistical model.

Datasets

English Grammar Databases

Use English Grammar dictionaries. Look for the word "is" and copy all the usage examples with label "is_handling". Look for the word "has" and copy all the usage examples with label "has_handling" and so on and so forth.

One could go a degree further and look at all synonyms of "is", then for each close synonym found, look at that word's grammar usage examples.