A list of corpora and corpus-related tools

Let's collaborate on building this document.
For "Access", indicate if the corpus is searchable online, needs purchasing, or freely downloadable.
Don't worry about putting resources in alphabetical order! Just add whatever you like, and make sure you are not adding entries that someone else has already listed.

Corpora mentioned in Gries & Newman

Name/link	Access	Summary
The British National Corpus	online or purchase	100 million word collection of samples of written and spoken language from multiple sources, designed to represent a wide cross-section of British English from the late 20th century
BNC Baby	purchase	"Baby" version of the BNC. Contains BNC sampler and Brown corpus.
The Buckeye Corpus	free online for noncommercial use	Audio and text files, with phonetic labels, of interviews with 40 people in Columbus, OH. Formatted for speech analysis software.
TIMIT - The Acoustic-Phonetic Continuous Speech Corpus	purchase	Contains recordings of 630 american speakers of 8 major dialects reading sentences. Includes time-aligned orthographic, phonetic and word transcriptions. 16 bit, 16kHz speech waveform files.
ICE - International Corpus of English	free for non-commerical, academic research	26 corpora of national/regional varieties of English. Each ICE corpus contains 500 texts of ~2,000 words each, for a total of ~1 million words. Corpus data is spoken and written English produced after 1989.
The Uppsala Learner English Corpus	free for research and educational purposes.	The corpus consists of 1,489 essays written by 440 Swedish university students of English at three different levels, the majority in their first term of full-time studies. The total number of words is 1,221,265, which means an average essay length of 820 words. A typical first-term essay is somewhat shorter, averaging 777 words.
Corpus of Historical American English (COHA)	Searchable for free online (limited number of queries); full-text data available for purchase	Contains more than 400 million words of text from the 1810s-2009. The corpus is balanced by genre decade by decade, such as fiction, magazine, and news.
TalkBank	freely available online	The corpus is a collection of corpora and transcripts. It offers adult and child language corpora in various media designed for many different types of studies (aphasia, dementia, second language acquisition, conversation analysis, and sociolinguistics.)
Brown corpus	free	Balanced million-word corpus developed in late 1960s. Available at the link to the left and multiple other locations listed in Gries & Newman.
Corpus of Contemporary American English	free	The corpus contains more than 560 million words of text (20 million words each year 1990-2017) and it is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts.
International Computer Archive of Modern and Medieval English Collection	purchase by CD-ROM	Created by an organization of linguists and information scientists who aim to distribute English language material for machine learning, along with compiling an archive of English text corpora for research institutions.
Southern Oral History Program	free	a collection of oral history interviews from the south, including several hundred (maybe thousand, it's unclear the total number) of audio files, transcripts of some of the interviews as well as the occupation, ethnicity, date of birth, and gender of some of the interviewees.

Additional corpora

Name/link	Access	Summary
Tagged and Cleaned Wikipedia and its Ngram	freely downloadable	A static version (html documents) of the English Wikipedia, downloaded in 2008. Tagged and cleaned, includes n-grams.
Corpus of English Dialogues: 1560-1760	Free to download after registration	177 textfiles containing constructed dialogue or transcribed authentic speech.
Open American National Corpus	free	15 million words divided amongst written and spoken sources. Sample sections include Switchboard, Slate Magazine, and ICIC corpus
National Center for Sign Language and Gesture Resources Corpus	free to browse online, can download annotations	Linguistically annotated ASL data, with multiple synchronized video files showing views from different angles and a close-up of the face, as well as associated linguistic annotations available as XML. Contains 1,866 distinct canonical signs and 11,854 sign tokens.
The Bergen Corpus of London Teenage Language (COLT)	free	collected in 1993 and consists of the spoken language of 13 to 17-year-old teenagers from different boroughs of London. The complete corpus, half a million words, has been orthographically transcribed and word-class tagged, and is a constituent of the British National Corpus.
Characterising Individual Speakers (CHAINS)	freely downloadable for research purposes	A corpus collected with the aim of facilitating research in speaker identification. Contains novel speech from 36 speakers recorded under a variety of speaking conditions.
Santa Barbara Corpus of Spoken American English	freely downloadable online	249,000 words, includes transcriptions, audio, and timestamps. This corpus contains recordings of naturally occuring spoken interaction from all over the United States, from different origins, ages, occupations, genders, and ethnic/social backgrounds. Includes predominantly face-to-face interaction, but also contains various other ways of interaction.
Russian National Corpus	free query access on line; subset available for download with restrictions	Corpus of the modern Russian language incorporating over 300 million words. Subcorpora include Deeply Annotated corpus, Parallel corpora (English, German, Ukrainian, Belorussian, and multilingual), Dialectal corpus, Poetry corpus, Educational corpus, Corpus of spoken Russian.
Sinica Treebank Corpus Sample	purchase	Sinica Treebank 3.0 contains 6 files, 61,087 syntactic tree structures, and 361,834 words. The tree structures were extracted from the Sinica Corpus, and every structure is segmented and parsed. Each segmented word of a tree structure is tagged with its part-of-speech and argument. Sinica Treebank 3.0 is provided free on the website for syntactic and semantic research use. 1,000 syntactic tree structures are available.
Tehran-English Parallel Corpus	free	The corpus consists of approximately 1.22 million sentence fragments of English subtitles and their Persian translations. It states the corpus is "cleaned", but the file itself does have some interesting formatting that still needs to be processed.
GICR: General Internet-Corpus of Russian	not available for download; search requires an account, but it's unclear how to make one	19.8 billion tokens/7 million word forms collected from various Russian language websites
Leeds University Russian Internet Corpus	free to search	160 million word internet corpus of Russian websites

Tools and software

Name/link	Access	Summary
AntConc Concordancer	free	A freeware corpus analysis toolkit for concordancing and text analysis
CLAWS tagger	purchase	A part-of-speech tagger for English text.
Xaira Searching Software	free	Software for indexing and analyzing large XML resources like corpora
ELAN	free	Tool for the creation of complex annotations on video and audio resources
TYPECRAFT	free	multilingual Interlinear Glossed Text (IGT) Bank.
FreeLing	free	FreeLing is a C++ library providing language analysis functionalities for a variety of languages (English, Spanish, Portuguese, Italian, French, German, Russian, Catalan, Galician, Croatian, Slovene, among others).
jEdit	free	A Java-based text editor that has useful features for fomatting texts for corpus-based research. For example, it accepts many different language encodings, allows for search and replace over multiple files, and features search and replace operations using regular expressions.
Voyant tools	free	Web-based text reading and analysis environment, with rich array of data visualizations.
ICECUP	free	ICECUP 3.1 (the ICE Corpus Utility Program) is a state-of-the-art corpus exploration program designed for parsed corpora such as ICE-GB and DCPSE.
ParsPer	free	This parser POS-tags Persian sentences with up to 87% accuracy, It was trained on the Uppsala Persian Dependency Treebank.
Stanford Parser	free	This is a statistical parser of English and a few other language developed by Stanford which uses java and is also able to be run from the command line.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

corpora_tools_list.md

corpora_tools_list.md

A list of corpora and corpus-related tools

Corpora mentioned in Gries & Newman

Additional corpora

Tools and software

Files

corpora_tools_list.md

Latest commit

History

corpora_tools_list.md

File metadata and controls

A list of corpora and corpus-related tools

Corpora mentioned in Gries & Newman

Additional corpora

Tools and software