Skip to content

Commit

Permalink
Release prep
Browse files Browse the repository at this point in the history
  • Loading branch information
chrzyki committed Nov 11, 2019
1 parent 56ca0a7 commit 7565c3d
Show file tree
Hide file tree
Showing 11 changed files with 701 additions and 285 deletions.
27 changes: 27 additions & 0 deletions FORMS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
## Specification of form manipulation


Specification of the value-to-form processing in Lexibank datasets:

The value-to-form processing is divided into two steps, implemented as methods:
- `FormSpec.split`: Splits a string into individual form chunks.
- `FormSpec.clean`: Normalizes a form chunk.

These methods use the attributes of a `FormSpec` instance to configure their behaviour.

- `brackets`: `{'(': ')'}`
Pairs of strings that should be recognized as brackets, specified as `dict` mapping opening string to closing string
- `separators`: `;/,`
Iterable of single character tokens that should be recognized as word separator
- `missing_data`: `['*', '---', '-']`
Iterable of strings that are used to mark missing data
- `strip_inside_brackets`: `True`
Flag signaling whether to strip content in brackets (**and** strip leading and trailing whitespace)
- `replacements`: `[]`
List of pairs (`source`, `target`) used to replace occurrences of `source` in formswith `target` (before stripping content in brackets)
- `first_form_only`: `False`
Flag signaling whether at most one form should be returned from `split` - effectively ignoring any spelling variants, etc.
- `normalize_whitespace`: `True`
Flag signaling whether to normalize whitespace - stripping leading and trailing whitespace and collapsing multi-character whitespace to single spaces
- `normalize_unicode`: `None`
UNICODE normalization form to use for input of `split` (`None`, 'NFD' or 'NFC')
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,11 +29,11 @@ This dataset comprises 25 Hmong-Mien varieties, which were originally digitized

- **Varieties:** 25
- **Concepts:** 883
- **Lexemes:** 21,573
- **Lexemes:** 21,967
- **Sources:** 1
- **Synonymy:** 1.01
- **Synonymy:** 1.02
- **Invalid lexemes:** 0
- **Tokens:** 114,328
- **Tokens:** 115,869
- **Segments:** 245 (0 BIPA errors, 0 CTLS sound class errors, 245 CLTS modified)
- **Inventory size (avg):** 70.76

Expand Down
248 changes: 124 additions & 124 deletions TRANSCRIPTION.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,184 +5,184 @@

| Segment | Occurrence | BIPA | CLTS SoundClass |
|:----------|-------------:|:-------|:------------------|
| + | 9578 |||
| a | 7344 |||
| ŋ | 5164 |||
| ³³ | 3929 |||
| o | 3784 |||
| ³¹ | 3600 |||
| ⁵⁵ | 3579 |||
| u | 3450 |||
| n | 3275 |||
| t | 3275 |||
| ³⁵ | 3246 |||
| i | 3114 |||
| ⁴⁴ | 2937 |||
| e | 2726 |||
| j | 2518 |||
| k | 2504 |||
| ⁵³ | 2468 |||
| p | 2268 |||
| ¹³ | 2212 |||
| l | 1894 |||
| ʔ | 1649 |||
| m | 1648 |||
| ²² | 1569 |||
| ə | 1511 |||
| ²⁴ | 1456 |||
| ⁴² | 1406 |||
|| 1335 |||
| q | 1270 |||
| ɔ | 1257 |||
| ei | 1221 |||
| s | 1112 |||
| au | 1080 |||
| ¹¹ | 853 |||
| ts | 817 |||
| ⁰/² | 804 |||
| ai | 778 |||
| w | 701 |||
| ʑ | 697 |||
| eu | 680 |||
| h | 661 |||
| ɛ | 606 |||
| ²¹ | 578 |||
| ȵ | 545 |||
| ɑ | 545 |||
| ɬ | 544 |||
|| 512 |||
|| 494 |||
| ⁰/³ | 491 |||
| u/w | 472 |||
| ɕ | 467 |||
| d | 437 |||
| v | 385 |||
| ⁿt | 382 |||
| ⁿp | 362 |||
|| 357 |||
| ³¹³ | 352 |||
| ³² | 346 |||
| tsʰ | 343 |||
| ɯ | 312 |||
| b | 302 |||
| ʐ | 295 |||
| f | 278 |||
| ⁴³ | 265 |||
| əu | 264 |||
| tɕʰ | 263 |||
|| 233 |||
| z | 223 |||
| æ | 214 |||
| g | 207 |||
| | 207 |||
|| 198 |||
| ui | 198 |||
| ⁵⁴ | 189 |||
| ɣ | 188 |||
| + | 9624 |||
| a | 7436 |||
| ŋ | 5232 |||
| ³³ | 3976 |||
| o | 3846 |||
| ³¹ | 3638 |||
| ⁵⁵ | 3625 |||
| u | 3494 |||
| n | 3326 |||
| t | 3317 |||
| ³⁵ | 3296 |||
| i | 3153 |||
| ⁴⁴ | 2971 |||
| e | 2767 |||
| j | 2556 |||
| k | 2541 |||
| ⁵³ | 2510 |||
| p | 2292 |||
| ¹³ | 2244 |||
| l | 1945 |||
| ʔ | 1676 |||
| m | 1671 |||
| ²² | 1596 |||
| ə | 1534 |||
| ²⁴ | 1484 |||
| ⁴² | 1437 |||
|| 1348 |||
| q | 1280 |||
| ɔ | 1276 |||
| ei | 1238 |||
| s | 1125 |||
| au | 1089 |||
| ¹¹ | 864 |||
| ts | 826 |||
| ⁰/² | 807 |||
| ai | 791 |||
| ʑ | 710 |||
| w | 709 |||
| eu | 689 |||
| h | 671 |||
| ɛ | 621 |||
| ²¹ | 592 |||
| ȵ | 555 |||
| ɑ | 552 |||
| ɬ | 548 |||
|| 517 |||
|| 500 |||
| ⁰/³ | 496 |||
| u/w | 480 |||
| ɕ | 470 |||
| d | 443 |||
| v | 393 |||
| ⁿt | 388 |||
| ⁿp | 366 |||
|| 362 |||
| ³¹³ | 353 |||
| ³² | 352 |||
| tsʰ | 346 |||
| ɯ | 321 |||
| b | 309 |||
| ʐ | 300 |||
| f | 279 |||
| əu | 272 |||
| ⁴³ | 270 |||
| tɕʰ | 265 |||
|| 234 |||
| z | 229 |||
| æ | 219 |||
| | 211 |||
| g | 209 |||
|| 208 |||
| ui | 202 |||
| ⁵⁴ | 192 |||
| ɣ | 191 |||
| ɿ | 186 |||
| ⁿk | 180 |||
|| 175 |||
| ⁿk | 183 |||
|| 178 |||
| ²³¹ | 177 |||
| ʈ | 173 |||
|| 171 |||
| ʈ | 171 |||
| ²³¹ | 170 |||
| ⁰/⁵ | 167 |||
| i/j | 162 |||
| ⁵¹ | 162 |||
| ⁿtɕ | 151 |||
| ⁵¹ | 164 |||
| i/j | 163 |||
| ⁿtɕ | 153 |||
|| 146 |||
| x | 139 |||
|| 131 |||
| õ | 130 |||
| y | 129 |||
| x | 146 |||
|| 136 |||
| õ | 134 |||
| y | 131 |||
| əɯ | 127 |||
|| 125 |||
| ʂ | 123 |||
|| 126 |||
| ʂ | 126 |||
| ɒ | 122 |||
| ʃ | 120 |||
|| 113 |||
| ʈʂ | 101 |||
| | 100 |||
| ²⁴¹ | 98 |||
| ⁿts | 90 |||
| ã | 88 |||
|| 83 |||
| ⁿb | 81 |||
| ⁿq | 81 |||
|| 114 |||
| | 105 |||
| ʈʂ | 104 |||
| ²⁴¹ | 102 |||
| ⁿts | 92 |||
| ã | 89 |||
|| 87 |||
| ⁿb | 82 |||
| ⁿq | 82 |||
| ŋ̩ | 78 |||
| ²¹² | 74 |||
| ə̃ | 73 |||
| ⁿʈ | 73 |||
| dz | 71 |||
| ə̃ | 71 |||
| ⁰/⁴ | 70 |||
| ⁿʈ | 70 |||
| ⁰/⁴ | 71 |||
|| 67 |||
| θ | 66 |||
|| 65 |||
| ey | 64 |||
| ¹² | 64 |||
| ˀt | 63 |||
| ɕʰ | 57 |||
| ʁ | 57 |||
| ʁ | 59 |||
| ɕʰ | 58 |||
| ʈʂʰ | 58 |||
| æ̃ | 56 |||
| ˀl | 56 |||
| ˀp | 55 |||
| aːi | 54 |||
| ˀp | 54 |||
| ʈʂʰ | 53 |||
| ˀl | 52 |||
|| 51 |||
|| 52 |||
| ð | 51 |||
| oi | 48 |||
| ²³² | 47 |||
| ˀʑ | 47 |||
| χ | 47 |||
| ²³² | 44 |||
| ȵ̥̥ | 44 |||
| ⁿtsʰ | 44 |||
| ȵ̥̥ | 43 |||
| c | 42 |||
| c | 43 |||
| tʃʰ | 42 |||
| ⁿʈʂ | 41 |||
| ⁿpʰ | 40 |||
| ⁿʈʂ | 40 |||
| aːu | 38 |||
| ⁿtʃ | 38 |||
| aːu | 37 |||
| ŋ̥ | 37 |||
| ˀj | 37 |||
| oːi | 36 |||
| ŋ̥ | 35 |||
| ʈʰ | 34 |||
| ʈʰ | 35 |||
| ȵ̥ | 32 |||
| əi | 32 |||
| ⁿkʰ | 31 |||
| ⁿtʰ | 31 |||
| ˀŋ | 30 |||
| ⁿtʰ | 30 |||
| ɦ | 28 |||
| ɦ | 29 |||
| ɬʰ | 28 |||
|| 26 |||
| ŋ̍ | 26 |||
| ⁿɢ | 25 |||
|| 24 |||
|| 24 |||
| ⁿdʑ | 24 |||
|| 23 |||
|| 23 |||
| ɥ | 23 |||
| ɭ | 23 |||
|| 22 |||
| ɪ | 22 |||
| ˀn | 22 |||
| ⁿdʱ | 21 |||
| ⁿtɕʰ | 21 |||
| ȵʱ | 20 |||
| ɭ | 20 |||
| ⁿd | 20 |||
| ⁿtɕʰ | 20 |||
| ĩ | 19 |||
| ⁿdz | 19 |||
|| 18 |||
| ĩ | 18 |||
| ɢ | 18 |||
| ʍ | 18 |||
| tɬʰ | 17 |||
| ɔi | 17 |||
| ɬʲ | 17 |||
| ɳ | 17 |||
| øe | 16 |||
| ⁿg | 16 |||
| ⁿgʱ | 16 |||
|| 15 |||
|| 15 |||
|| 15 |||
|| 15 |||
| ɬʲ | 15 |||
|| 14 |||
| ʅ | 14 |||
| ˀȵ | 14 |||
| ⁿʈʂʰ | 14 |||
Expand All @@ -192,13 +192,13 @@
|| 11 |||
| ⁿqʰ | 11 |||
|| 10 |||
|| 10 |||
| y/ɥ | 10 |||
| ʑʱ | 10 |||
| ˀw | 10 |||
| ⁿʈʰ | 10 |||
| dzʱ | 9 |||
| uːi | 9 |||
|| 9 |||
| ȵ̩ | 9 |||
| ⁿtʃʰ | 9 |||
| ⁿɢʱ | 9 |||
Expand All @@ -218,8 +218,8 @@
| ɖʱ | 5 |||
| ⁿdzʱ | 5 |||
| ⁿsʰ | 5 |||
| dʐʱ | 4 |||
| ⁿbʱ | 4 |||
| dʐʱ | 3 |||
| u/ʐʷ | 3 |||
| ɒi | 3 |||
| ⁿbʰ | 3 |||
Expand All @@ -230,6 +230,7 @@
|| 2 |||
| ⁵² | 2 |||
| ⁿdʐ | 2 |||
| ⁿtθʰ | 2 |||
| aəi/ai | 1 |||
|| 1 |||
| ji/j | 1 |||
Expand All @@ -248,7 +249,6 @@
| ʂʰ | 1 |||
| ⁴¹ | 1 |||
| ⁿdʐʱ | 1 |||
| ⁿtθʰ | 1 |||
| ⁿɖʐʱ | 1 |||

(245 rows)
Expand Down
Loading

0 comments on commit 7565c3d

Please sign in to comment.