Skip to content

0.1.0

Latest
Compare
Choose a tag to compare
@olzama olzama released this 16 May 14:18
· 26 commits to main since this release

This release is for the version of the SRG-MAL grammar which is intended for the experiments associated with the GAUSS project. I contains a treebanked portion of the COWSL2H corpus of learner Spanish (Yamada et al. 2020). The corpus contains some of the sentences which had been annotated by Yamada et al. (2020) for gender agreement errors, 442 sentences in total. These 442 sentences were parsed with the modified version of the Spanish Resource Grammar (which is included here). The modified version contains special rules which are intended to recognize gender agreement constructions which learners use (e.g. agua frio). Of the 442 sentences, 177 we consider "grammatical", including, however, the learner usage of gender. Other sentences, 265 total, contain other types of "errors" (usages not characteristic for Spanish L1 speakers).

If you run `python util/treebanking-scripts/report_stats.py on the treebanked corpora, you will get some accuracy numbers. Those should only be taken as approximate because there is currently a certain lack of consistency in which items should be counted in which manner (which should be marked as grammatical and which as not grammatical).

We are currently working on the next release, in which there will be another portion of the corpus, hopefully with clearer numbers.

Current numbers:

Sentences up to length 9:

Total accuracy: 63 out of 77 (0.8182)
Total overgeneration: 7 out of 60 (0.1167)

Sentences of length 10-20:

Total accuracy: 62 out of 100 (0.6200)
Total overgeneration: 22 out of 205 (0.1073)