@@ -11,7 +11,7 @@ in the usual way, either by getting it from pip:
11
11
12
12
pip3 install wordfreq
13
13
14
- or by getting the repository and installing it using [ poetry] [ ] :
14
+ or by getting the repository and installing it for development, using [ poetry] [ ] :
15
15
16
16
poetry install
17
17
@@ -23,8 +23,8 @@ steps that are necessary to get Chinese, Japanese, and Korean word frequencies.
23
23
## Usage
24
24
25
25
wordfreq provides access to estimates of the frequency with which a word is
26
- used, in 36 languages (see * Supported languages* below). It uses many different
27
- data sources, not just one corpus.
26
+ used, in over 40 languages (see * Supported languages* below). It uses many
27
+ different data sources, not just one corpus.
28
28
29
29
It provides both 'small' and 'large' wordlists:
30
30
@@ -144,8 +144,8 @@ as `##` or `####` or `#.#####`, with `#` standing in for digits. (For compatibil
144
144
with earlier versions of wordfreq, our stand-in character is actually ` 0 ` .) This
145
145
is the same form of aggregation that the word2vec vocabulary does.
146
146
147
- Single-digit numbers are unaffected by this "binning" process; "0" through "9" have
148
- their own entries in each language's wordlist.
147
+ Single-digit numbers are unaffected by this process; "0" through "9" have their own
148
+ entries in each language's wordlist.
149
149
150
150
When asked for the frequency of a token containing multiple digits, we multiply
151
151
the frequency of that aggregated entry by a distribution estimating the frequency
@@ -158,10 +158,10 @@ The first digits are assigned probabilities by Benford's law, and years are assi
158
158
probabilities from a distribution that peaks at the "present". I explored this in
159
159
a Twitter thread at < https://twitter.com/r_speer/status/1493715982887571456 > .
160
160
161
- The part of this distribution representing the "present" is not strictly a peak;
162
- it's a 20-year-long plateau from 2019 to 2039. (2019 is the last time Google Books
163
- Ngrams was updated, and 2039 is a time by which I will probably have figured out
164
- a new distribution.)
161
+ The part of this distribution representing the "present" is not strictly a peak and
162
+ doesn't move forward with time as the present does. Instead, it's a 20-year-long
163
+ plateau from 2019 to 2039. (2019 is the last time Google Books Ngrams was updated,
164
+ and 2039 is a time by which I will probably have figured out a new distribution.)
165
165
166
166
Some examples:
167
167
@@ -172,7 +172,7 @@ Some examples:
172
172
>>> word_frequency("1022", "en")
173
173
1.28e-07
174
174
175
- Aside from years, the distribution does ** not** care about the meaning of the numbers:
175
+ Aside from years, the distribution does not care about the meaning of the numbers:
176
176
177
177
>>> word_frequency("90210", "en")
178
178
3.34e-10
@@ -419,19 +419,16 @@ As much as we would like to give each language its own distinct code and its
419
419
own distinct word list with distinct source data, there aren't actually sharp
420
420
boundaries between languages.
421
421
422
- Sometimes, it's convenient to pretend that the boundaries between
423
- languages coincide with national borders, following the maxim that "a language
424
- is a dialect with an army and a navy" (Max Weinreich). This gets complicated
425
- when the linguistic situation and the political situation diverge.
426
- Moreover, some of our data sources rely on language detection, which of course
427
- has no idea which country the writer of the text belongs to.
422
+ Sometimes, it's convenient to pretend that the boundaries between languages
423
+ coincide with national borders, following the maxim that "a language is a
424
+ dialect with an army and a navy" (Max Weinreich). This gets complicated when the
425
+ linguistic situation and the political situation diverge. Moreover, some of our
426
+ data sources rely on language detection, which of course has no idea which
427
+ country the writer of the text belongs to.
428
428
429
429
So we've had to make some arbitrary decisions about how to represent the
430
430
fuzzier language boundaries, such as those within Chinese, Malay, and
431
- Croatian/Bosnian/Serbian. See [ Language Log] [ ] for some firsthand reports of
432
- the mutual intelligibility or unintelligibility of languages.
433
-
434
- [ Language Log ] : http://languagelog.ldc.upenn.edu/nll/?p=12633
431
+ Croatian/Bosnian/Serbian.
435
432
436
433
Smoothing over our arbitrary decisions is the fact that we use the ` langcodes `
437
434
module to find the best match for a language code. If you ask for word
@@ -446,6 +443,9 @@ the 'cjk' feature:
446
443
447
444
pip install wordfreq[cjk]
448
445
446
+ You can put ` wordfreq[cjk] ` in a list of dependencies, such as the
447
+ ` [tool.poetry.dependencies] ` list of your own project.
448
+
449
449
Tokenizing Chinese depends on the ` jieba ` package, tokenizing Japanese depends
450
450
on ` mecab-python3 ` and ` ipadic ` , and tokenizing Korean depends on ` mecab-python3 `
451
451
and ` mecab-ko-dic ` .
0 commit comments