This project visualizes three datasets:
Twitter Corpus
(23.7 billion words, 2017-2023)
Corpus of Global Language Use
(ISLRN: 951-235-998-601-3)
(329.4 billion words, 2013-2019, from the Common Crawl)
GeoWAC
(ISLRN: 946-519-559-042-9)
(42 billion words; geographically-balanced gigaword corpora for 48 languages)
The per-country aggregates can be found in the docs/data folder as CSV files.
View this project through GitHub Pages: https://jonathandunn.github.io/earthLings/
The full web dataset is now available through this repository: CGLU -> https://www.earthlings.io/download_cglu.html GeoWAC -> https://www.earthlings.io/download_geowac.html
For a description of data collection procedures and the language identification component, see this paper: https://jdunn.name/2020/03/08/mapping-languages-the-corpus-of-global-language-use/
For a description of population-based sampling techniques to create unbiased corpora, see this paper: https://jdunn.name/2020/03/08/geographically-balanced-gigaword-corpora-for-50-language-varieties/
For a study of changes in linguistic diversity during COVID-19, see this paper: https://jdunn.name/2020/10/14/measuring-linguistic-diversity-during-covid-19/
For a demographic and census-based evaluation of these corpora, see this paper: https://jdunn.name/2019/07/22/mapping-languages-and-demographics-with-georeferenced-corpora/
For an overview of dialectal variation and dialect uniqueness values, see this paper: https://jdunn.name/2019/07/22/global-syntactic-variation-in-seven-languages-towards-a-computational-dialectology/
You can also look at my related repositories:
Language ID: https://github.com/jonathandunn/idNet
Web Collection: https://github.com/jonathandunn/common_crawl_corpus