earthLings

Corpus-based language and dialect mapping

This project visualizes three datasets:

Twitter Corpus 
	(23.7 billion words, 2017-2023) 
	
Corpus of Global Language Use
	(ISLRN: 951-235-998-601-3)
	(329.4 billion words, 2013-2019, from the Common Crawl)
	
GeoWAC 
	(ISLRN: 946-519-559-042-9)
	(42 billion words; geographically-balanced gigaword corpora for 48 languages)

The per-country aggregates can be found in the docs/data folder as CSV files.

View this project through GitHub Pages: https://jonathandunn.github.io/earthLings/

The full web dataset is now available through this repository: CGLU -> https://www.earthlings.io/download_cglu.html GeoWAC -> https://www.earthlings.io/download_geowac.html

For a description of data collection procedures and the language identification component, see this paper: https://jdunn.name/2020/03/08/mapping-languages-the-corpus-of-global-language-use/

For a description of population-based sampling techniques to create unbiased corpora, see this paper: https://jdunn.name/2020/03/08/geographically-balanced-gigaword-corpora-for-50-language-varieties/

For a study of changes in linguistic diversity during COVID-19, see this paper: https://jdunn.name/2020/10/14/measuring-linguistic-diversity-during-covid-19/

For a demographic and census-based evaluation of these corpora, see this paper: https://jdunn.name/2019/07/22/mapping-languages-and-demographics-with-georeferenced-corpora/

For an overview of dialectal variation and dialect uniqueness values, see this paper: https://jdunn.name/2019/07/22/global-syntactic-variation-in-seven-languages-towards-a-computational-dialectology/

You can also look at my related repositories:

Language ID: https://github.com/jonathandunn/idNet

Web Collection: https://github.com/jonathandunn/common_crawl_corpus

Name		Name	Last commit message	Last commit date
Latest commit History 1,139 Commits
GeoWAC		GeoWAC
docs		docs
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

earthLings

Corpus-based language and dialect mapping

About

Releases

Packages

License

jonathandunn/earthLings

Folders and files

Latest commit

History

Repository files navigation

earthLings

Corpus-based language and dialect mapping

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages