Skip to content

Latest commit

 

History

History
24 lines (18 loc) · 1.21 KB

data.md

File metadata and controls

24 lines (18 loc) · 1.21 KB

Datasets in our Sample

The datasets are produced with the following commands:

$ python check-data.py aa cutout 180 write coverage
$ python check-data.py an cutout 175 write coverage
$ python check-data.py ie cutout 180 write coverage
$ python check-data.py pn cutout 160 write coverage
$ python check-data.py st cutout 100 write coverage

The code cleans the datasets by replacing certain bad characters (unicode lookalikes, etc.) and selecting only those languages with the required coverage. The initial requirement for the coverage is the "cutout" parameter, namely all languages which have counterparts for less than the given number of concepts will be disregarded.

The following table shows the current statistics for the selected data:

Dataset Family Taxa Concepts Concept Coverage Minimal Mutual Coverage Average Mutual Coverage
aa-58-200 Austro-Asiatic 58 200 0.95 163 180
an-45-210 Austronesian 45 210 0.88 144 165
ie-42-208 Indo-European 42 208 0.97 161 197
pn-67-183 Pama-Nyungan 67 183 0.94 141 163
st-64-110 Sino-Tibetan 64 110 0.96 90 101