Skip to content

Commit bc12599

Browse files
author
Lance Nathan
authored
Merge pull request #60 from LuminosoInsight/gender-neutral-at
Recognize "@" in gender-neutral word endings as part of the token
2 parents ca9cf7d + d9fc6ec commit bc12599

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

56 files changed

+36676
-35956
lines changed

CHANGELOG.md

+21
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,24 @@
1+
## Version 2.2 (2018-07-24)
2+
3+
Library change:
4+
5+
- While the @ sign is usually considered a symbol and not part of a word, there
6+
is a case where it acts like a letter. It's used in one way of writing
7+
gender-neutral words in Spanish and Portuguese, such as "l@s niñ@s". The
8+
tokenizer in wordfreq will now allow words to end with "@" or "@s", so it
9+
can recognize these words.
10+
11+
Data changes:
12+
13+
- Updated the data from Exquisite Corpus to filter the ParaCrawl web crawl
14+
better. ParaCrawl provides two metrics (Zipporah and Bicleaner) for the
15+
goodness of its data, and we now filter it to only use texts that get
16+
positive scores on both metrics.
17+
18+
- The input data includes the change to tokenization described above, giving
19+
us word frequencies for words such as "l@s".
20+
21+
122
## Version 2.1 (2018-06-18)
223

324
Data changes:

README.md

+17-6
Original file line numberDiff line numberDiff line change
@@ -48,13 +48,13 @@ frequency as a decimal between 0 and 1.
4848
1.07e-05
4949

5050
>>> word_frequency('café', 'en')
51-
5.89e-06
51+
5.75e-06
5252

5353
>>> word_frequency('cafe', 'fr')
5454
1.51e-06
5555

5656
>>> word_frequency('café', 'fr')
57-
5.25e-05
57+
5.13e-05
5858

5959

6060
`zipf_frequency` is a variation on `word_frequency` that aims to return the
@@ -78,10 +78,10 @@ one occurrence per billion words.
7878
5.29
7979

8080
>>> zipf_frequency('frequency', 'en')
81-
4.42
81+
4.43
8282

8383
>>> zipf_frequency('zipf', 'en')
84-
1.55
84+
1.57
8585

8686
>>> zipf_frequency('zipf', 'en', wordlist='small')
8787
0.0
@@ -276,7 +276,8 @@ produces tokens that follow the recommendations in [Unicode
276276
Annex #29, Text Segmentation][uax29], including the optional rule that
277277
splits words between apostrophes and vowels.
278278

279-
There are language-specific exceptions:
279+
There are exceptions where we change the tokenization to work better
280+
with certain languages:
280281

281282
- In Arabic and Hebrew, it additionally normalizes ligatures and removes
282283
combining marks.
@@ -288,19 +289,29 @@ There are language-specific exceptions:
288289
- In Chinese, it uses the external Python library `jieba`, another optional
289290
dependency.
290291

292+
- While the @ sign is usually considered a symbol and not part of a word,
293+
wordfreq will allow a word to end with "@" or "@s". This is one way of
294+
writing gender-neutral words in Spanish and Portuguese.
295+
291296
[uax29]: http://unicode.org/reports/tr29/
292297

293298
When wordfreq's frequency lists are built in the first place, the words are
294299
tokenized according to this function.
295300

301+
>>> from wordfreq import tokenize
302+
>>> tokenize('l@s niñ@s', 'es')
303+
['l@s', 'niñ@s']
304+
>>> zipf_frequency('l@s', 'es')
305+
2.8
306+
296307
Because tokenization in the real world is far from consistent, wordfreq will
297308
also try to deal gracefully when you query it with texts that actually break
298309
into multiple tokens:
299310

300311
>>> zipf_frequency('New York', 'en')
301312
5.28
302313
>>> zipf_frequency('北京地铁', 'zh') # "Beijing Subway"
303-
3.57
314+
3.61
304315

305316
The word frequencies are combined with the half-harmonic-mean function in order
306317
to provide an estimate of what their combined frequency would be. In Chinese,

setup.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@
3535

3636
setup(
3737
name="wordfreq",
38-
version='2.1.0',
38+
version='2.2.0',
3939
maintainer='Luminoso Technologies, Inc.',
4040
maintainer_email='info@luminoso.com',
4141
url='http://github.com/LuminosoInsight/wordfreq/',

tests/test_at_sign.py

+109
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
from wordfreq import tokenize, lossy_tokenize, word_frequency
2+
3+
4+
def test_gender_neutral_at():
5+
# Recognize the gender-neutral @ in Spanish as part of the word
6+
text = "La protección de los derechos de tod@s l@s trabajador@s migrantes"
7+
assert tokenize(text, "es") == [
8+
"la",
9+
"protección",
10+
"de",
11+
"los",
12+
"derechos",
13+
"de",
14+
"tod@s",
15+
"l@s",
16+
"trabajador@s",
17+
"migrantes"
18+
]
19+
20+
text = "el distrito 22@ de Barcelona"
21+
assert tokenize(text, 'es') == ["el", "distrito", "22@", "de", "barcelona"]
22+
assert lossy_tokenize(text, 'es') == ["el", "distrito", "00@", "de", "barcelona"]
23+
24+
# It also appears in Portuguese
25+
text = "direitos e deveres para @s membr@s da comunidade virtual"
26+
assert tokenize(text, "pt") == [
27+
"direitos",
28+
"e",
29+
"deveres",
30+
"para",
31+
"@s",
32+
"membr@s",
33+
"da",
34+
"comunidade",
35+
"virtual"
36+
]
37+
38+
# Because this is part of our tokenization, the language code doesn't
39+
# actually matter, as long as it's a language with Unicode tokenization
40+
text = "@s membr@s da comunidade virtual"
41+
assert tokenize(text, "en") == ["@s", "membr@s", "da", "comunidade", "virtual"]
42+
43+
44+
def test_at_in_corpus():
45+
# We have a word frequency for "l@s"
46+
assert word_frequency('l@s', 'es') > 0
47+
48+
# It's not just treated as a word break
49+
assert word_frequency('l@s', 'es') < word_frequency('l s', 'es')
50+
51+
52+
def test_punctuation_at():
53+
# If the @ appears alone in a word, we consider it to be punctuation
54+
text = "operadores de canal, que são aqueles que têm um @ ao lado do nick"
55+
assert tokenize(text, "pt") == [
56+
"operadores",
57+
"de",
58+
"canal",
59+
"que",
60+
"são",
61+
"aqueles",
62+
"que",
63+
"têm",
64+
"um",
65+
"ao",
66+
"lado",
67+
"do",
68+
"nick"
69+
]
70+
71+
assert tokenize(text, "pt", include_punctuation=True) == [
72+
"operadores",
73+
"de",
74+
"canal",
75+
",",
76+
"que",
77+
"são",
78+
"aqueles",
79+
"que",
80+
"têm",
81+
"um",
82+
"@",
83+
"ao",
84+
"lado",
85+
"do",
86+
"nick"
87+
]
88+
89+
# If the @ is not at the end of the word or part of the word ending '@s',
90+
# it is also punctuation
91+
text = "un archivo hosts.deny que contiene la línea ALL:ALL@ALL"
92+
assert tokenize(text, "es") == [
93+
"un",
94+
"archivo",
95+
"hosts.deny",
96+
"que",
97+
"contiene",
98+
"la",
99+
"línea",
100+
"all:all",
101+
"all"
102+
]
103+
104+
# Make sure not to catch e-mail addresses
105+
text = "info@something.example"
106+
assert tokenize(text, "en") == [
107+
"info",
108+
"something.example"
109+
]

tests/test_chinese.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ def test_tokens():
5959

6060
def test_combination():
6161
xiexie_freq = word_frequency('谢谢', 'zh') # "Thanks"
62-
assert word_frequency('谢谢谢谢', 'zh') == pytest.approx(xiexie_freq / 20)
62+
assert word_frequency('谢谢谢谢', 'zh') == pytest.approx(xiexie_freq / 20, rel=0.01)
6363

6464

6565
def test_alternate_codes():

0 commit comments

Comments
 (0)