Wikipedia word frequency generator

This script processes wikipedia article dumps from https://dumps.wikimedia.org/enwiki/ and gathers the word frequency distribution data. The script uses wikiextractor to fetch raw text, then strips punctuation marks and normalizes unicode dashes and apostrophes. The script then disregards words that have a digit in them, and only takes words that were used in at least 3 different articles.

The script was inspired by this article which unfortunately provided very inaccurate data with punctuation marks and other sorts of inaccuracies.

Usage

The script needs Python 3. On macOS, there is a known bug with Python 3.8, so you will need to use Python 3.7 or lower.

Install requirements:

pip install -r requirements.txt

Download the current Wikipedia dumps for the desired language:

WIKI=enwiki
wget -np -r --accept-regex \
  "https:\/\/dumps\.wikimedia\.org\/${WIKI}\/latest\/${WIKI}-latest-pages-articles[0-9]*\.xml.bz2" \
  https://dumps.wikimedia.org/${WIKI}/latest/

Note that for enwiki (as of April 2023) this will require about 19 Gb of free space.

Parse dumps and save results:

python ./gather_wordfreq.py dumps.wikimedia.org/${WIKI}/latest/*.bz2 > wordfreq.txt

Pre-generated word frequency data

The word frequency data for English, Spanish, French, German, Italian, Portuguese, Dutch, Arabic, Polish, Egyptian, Japanese, Russian, Cebuano, Swedish, Ukrainian, Vietnamese, Chinese, Waray, Afrikaans & Swahili are provided at results.

English results:

Total unique words appearing at least in 3 articles: 2747823
Top 20 most popular words: the, of, in, and, a, to, was, is, on, for, as, with, by, he, that, at, from, his, it, an.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
results		results
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
gather_wordfreq.py		gather_wordfreq.py
requirements.in		requirements.in
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikipedia word frequency generator

Usage

Pre-generated word frequency data

About

Releases

Packages

Contributors 6

Languages

License

IlyaSemenov/wikipedia-word-frequency

Folders and files

Latest commit

History

Repository files navigation

Wikipedia word frequency generator

Usage

Pre-generated word frequency data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages