Skip to content

Commit 0fc7756

Browse files
committed
packaging updates
1 parent 3180972 commit 0fc7756

7 files changed

+42
-93
lines changed

CHANGELOG.md

+10-1
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ estimated distribution that allows for Benford's law (lower numbers are more
1313
frequent) and a special frequency distribution for 4-digit numbers that look
1414
like years (2010 is more frequent than 1020).
1515

16-
Relatedly:
16+
More changes related to digits:
1717

1818
- Functions such as `iter_wordlist` and `top_n_list` no longer return
1919
multi-digit numbers (they used to return them in their "smashed" form, such
@@ -23,6 +23,15 @@ Relatedly:
2323
instead in a place that's internal to the `word_frequency` function, so we can
2424
look at the values of the digits before they're replaced.
2525

26+
Other changes:
27+
28+
- wordfreq is now developed using `poetry` as its package manager, and with
29+
`pyproject.toml` as the source of configuration instead of `setup.py`.
30+
31+
- The minimum version of Python supported is 3.7.
32+
33+
- Type information is exported using `py.typed`.
34+
2635
## Version 2.5.1 (2021-09-02)
2736

2837
- Import ftfy and use its `uncurl_quotes` method to turn curly quotes into

Jenkinsfile

-4
This file was deleted.

README.md

+20-20
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ in the usual way, either by getting it from pip:
1111

1212
pip3 install wordfreq
1313

14-
or by getting the repository and installing it using [poetry][]:
14+
or by getting the repository and installing it for development, using [poetry][]:
1515

1616
poetry install
1717

@@ -23,8 +23,8 @@ steps that are necessary to get Chinese, Japanese, and Korean word frequencies.
2323
## Usage
2424

2525
wordfreq provides access to estimates of the frequency with which a word is
26-
used, in 36 languages (see *Supported languages* below). It uses many different
27-
data sources, not just one corpus.
26+
used, in over 40 languages (see *Supported languages* below). It uses many
27+
different data sources, not just one corpus.
2828

2929
It provides both 'small' and 'large' wordlists:
3030

@@ -144,8 +144,8 @@ as `##` or `####` or `#.#####`, with `#` standing in for digits. (For compatibil
144144
with earlier versions of wordfreq, our stand-in character is actually `0`.) This
145145
is the same form of aggregation that the word2vec vocabulary does.
146146

147-
Single-digit numbers are unaffected by this "binning" process; "0" through "9" have
148-
their own entries in each language's wordlist.
147+
Single-digit numbers are unaffected by this process; "0" through "9" have their own
148+
entries in each language's wordlist.
149149

150150
When asked for the frequency of a token containing multiple digits, we multiply
151151
the frequency of that aggregated entry by a distribution estimating the frequency
@@ -158,10 +158,10 @@ The first digits are assigned probabilities by Benford's law, and years are assi
158158
probabilities from a distribution that peaks at the "present". I explored this in
159159
a Twitter thread at <https://twitter.com/r_speer/status/1493715982887571456>.
160160

161-
The part of this distribution representing the "present" is not strictly a peak;
162-
it's a 20-year-long plateau from 2019 to 2039. (2019 is the last time Google Books
163-
Ngrams was updated, and 2039 is a time by which I will probably have figured out
164-
a new distribution.)
161+
The part of this distribution representing the "present" is not strictly a peak and
162+
doesn't move forward with time as the present does. Instead, it's a 20-year-long
163+
plateau from 2019 to 2039. (2019 is the last time Google Books Ngrams was updated,
164+
and 2039 is a time by which I will probably have figured out a new distribution.)
165165

166166
Some examples:
167167

@@ -172,7 +172,7 @@ Some examples:
172172
>>> word_frequency("1022", "en")
173173
1.28e-07
174174

175-
Aside from years, the distribution does **not** care about the meaning of the numbers:
175+
Aside from years, the distribution does not care about the meaning of the numbers:
176176

177177
>>> word_frequency("90210", "en")
178178
3.34e-10
@@ -419,19 +419,16 @@ As much as we would like to give each language its own distinct code and its
419419
own distinct word list with distinct source data, there aren't actually sharp
420420
boundaries between languages.
421421

422-
Sometimes, it's convenient to pretend that the boundaries between
423-
languages coincide with national borders, following the maxim that "a language
424-
is a dialect with an army and a navy" (Max Weinreich). This gets complicated
425-
when the linguistic situation and the political situation diverge.
426-
Moreover, some of our data sources rely on language detection, which of course
427-
has no idea which country the writer of the text belongs to.
422+
Sometimes, it's convenient to pretend that the boundaries between languages
423+
coincide with national borders, following the maxim that "a language is a
424+
dialect with an army and a navy" (Max Weinreich). This gets complicated when the
425+
linguistic situation and the political situation diverge. Moreover, some of our
426+
data sources rely on language detection, which of course has no idea which
427+
country the writer of the text belongs to.
428428

429429
So we've had to make some arbitrary decisions about how to represent the
430430
fuzzier language boundaries, such as those within Chinese, Malay, and
431-
Croatian/Bosnian/Serbian. See [Language Log][] for some firsthand reports of
432-
the mutual intelligibility or unintelligibility of languages.
433-
434-
[Language Log]: http://languagelog.ldc.upenn.edu/nll/?p=12633
431+
Croatian/Bosnian/Serbian.
435432

436433
Smoothing over our arbitrary decisions is the fact that we use the `langcodes`
437434
module to find the best match for a language code. If you ask for word
@@ -446,6 +443,9 @@ the 'cjk' feature:
446443

447444
pip install wordfreq[cjk]
448445

446+
You can put `wordfreq[cjk]` in a list of dependencies, such as the
447+
`[tool.poetry.dependencies]` list of your own project.
448+
449449
Tokenizing Chinese depends on the `jieba` package, tokenizing Japanese depends
450450
on `mecab-python3` and `ipadic`, and tokenizing Korean depends on `mecab-python3`
451451
and `mecab-ko-dic`.

poetry.lock

+6-1
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

pyproject.toml

+6
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ description = "Look up the frequencies of words in many languages, based on many
55
authors = ["Robyn Speer <rspeer@arborelia.net>"]
66
license = "MIT"
77
readme = "README.md"
8+
homepage = "https://github.com/rspeer/wordfreq/"
89

910
[tool.poetry.dependencies]
1011
python = "^3.7"
@@ -25,6 +26,11 @@ black = "^22.1.0"
2526
flake8 = "^4.0.1"
2627
types-setuptools = "^57.4.9"
2728

29+
[tool.poetry.extras]
30+
cjk = ["mecab-python3", "ipadic", "mecab-ko-dic", "jieba >= 0.42"]
31+
mecab = ["mecab-python3", "ipadic", "mecab-ko-dic"]
32+
jieba = ["jieba >= 0.42"]
33+
2834
[build-system]
2935
requires = ["poetry-core>=1.0.0"]
3036
build-backend = "poetry.core.masonry.api"

setup.cfg

-2
This file was deleted.

setup.py

-65
This file was deleted.

0 commit comments

Comments
 (0)