Skip to content

Latest commit

 

History

History
executable file
·
54 lines (48 loc) · 10 KB

corpora_tools_list.md

File metadata and controls

executable file
·
54 lines (48 loc) · 10 KB

A list of corpora and corpus-related tools

  • Let's collaborate on building this document.
  • For "Access", indicate if the corpus is searchable online, needs purchasing, or freely downloadable.
  • Don't worry about putting resources in alphabetical order! Just add whatever you like, and make sure you are not adding entries that someone else has already listed.

Corpora mentioned in Gries & Newman

Name/link Access Summary
The British National Corpus online or purchase 100 million word collection of samples of written and spoken language from multiple sources, designed to represent a wide cross-section of British English from the late 20th century
BNC Baby purchase "Baby" version of the BNC. Contains BNC sampler and Brown corpus.
The Buckeye Corpus free online for noncommercial use Audio and text files, with phonetic labels, of interviews with 40 people in Columbus, OH. Formatted for speech analysis software.
TIMIT - The Acoustic-Phonetic Continuous Speech Corpus purchase Contains recordings of 630 american speakers of 8 major dialects reading sentences. Includes time-aligned orthographic, phonetic and word transcriptions. 16 bit, 16kHz speech waveform files.
ICE - International Corpus of English free for non-commerical, academic research 26 corpora of national/regional varieties of English. Each ICE corpus contains 500 texts of ~2,000 words each, for a total of ~1 million words. Corpus data is spoken and written English produced after 1989.
The Uppsala Learner English Corpus free for research and educational purposes. The corpus consists of 1,489 essays written by 440 Swedish university students of English at three different levels, the majority in their first term of full-time studies. The total number of words is 1,221,265, which means an average essay length of 820 words. A typical first-term essay is somewhat shorter, averaging 777 words.
Corpus of Historical American English (COHA) Searchable for free online (limited number of queries); full-text data available for purchase Contains more than 400 million words of text from the 1810s-2009. The corpus is balanced by genre decade by decade, such as fiction, magazine, and news.
TalkBank freely available online The corpus is a collection of corpora and transcripts. It offers adult and child language corpora in various media designed for many different types of studies (aphasia, dementia, second language acquisition, conversation analysis, and sociolinguistics.)
Brown corpus free Balanced million-word corpus developed in late 1960s. Available at the link to the left and multiple other locations listed in Gries & Newman.
Corpus of Contemporary American English free The corpus contains more than 560 million words of text (20 million words each year 1990-2017) and it is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts.
International Computer Archive of Modern and Medieval English Collection purchase by CD-ROM Created by an organization of linguists and information scientists who aim to distribute English language material for machine learning, along with compiling an archive of English text corpora for research institutions.
Southern Oral History Program free a collection of oral history interviews from the south, including several hundred (maybe thousand, it's unclear the total number) of audio files, transcripts of some of the interviews as well as the occupation, ethnicity, date of birth, and gender of some of the interviewees.

Additional corpora

Name/link Access Summary
Tagged and Cleaned Wikipedia and its Ngram freely downloadable A static version (html documents) of the English Wikipedia, downloaded in 2008. Tagged and cleaned, includes n-grams.
Corpus of English Dialogues: 1560-1760 Free to download after registration 177 textfiles containing constructed dialogue or transcribed authentic speech.
Open American National Corpus free 15 million words divided amongst written and spoken sources. Sample sections include Switchboard, Slate Magazine, and ICIC corpus
National Center for Sign Language and Gesture Resources Corpus free to browse online, can download annotations Linguistically annotated ASL data, with multiple synchronized video files showing views from different angles and a close-up of the face, as well as associated linguistic annotations available as XML. Contains 1,866 distinct canonical signs and 11,854 sign tokens.
The Bergen Corpus of London Teenage Language (COLT) free collected in 1993 and consists of the spoken language of 13 to 17-year-old teenagers from different boroughs of London. The complete corpus, half a million words, has been orthographically transcribed and word-class tagged, and is a constituent of the British National Corpus.
Characterising Individual Speakers (CHAINS) freely downloadable for research purposes A corpus collected with the aim of facilitating research in speaker identification. Contains novel speech from 36 speakers recorded under a variety of speaking conditions.
Santa Barbara Corpus of Spoken American English freely downloadable online 249,000 words, includes transcriptions, audio, and timestamps. This corpus contains recordings of naturally occuring spoken interaction from all over the United States, from different origins, ages, occupations, genders, and ethnic/social backgrounds. Includes predominantly face-to-face interaction, but also contains various other ways of interaction.
Russian National Corpus free query access on line; subset available for download with restrictions Corpus of the modern Russian language incorporating over 300 million words. Subcorpora include Deeply Annotated corpus, Parallel corpora (English, German, Ukrainian, Belorussian, and multilingual), Dialectal corpus, Poetry corpus, Educational corpus, Corpus of spoken Russian.
Sinica Treebank Corpus Sample purchase Sinica Treebank 3.0 contains 6 files, 61,087 syntactic tree structures, and 361,834 words. The tree structures were extracted from the Sinica Corpus, and every structure is segmented and parsed. Each segmented word of a tree structure is tagged with its part-of-speech and argument. Sinica Treebank 3.0 is provided free on the website for syntactic and semantic research use. 1,000 syntactic tree structures are available.
Tehran-English Parallel Corpus free The corpus consists of approximately 1.22 million sentence fragments of English subtitles and their Persian translations. It states the corpus is "cleaned", but the file itself does have some interesting formatting that still needs to be processed.
GICR: General Internet-Corpus of Russian not available for download; search requires an account, but it's unclear how to make one 19.8 billion tokens/7 million word forms collected from various Russian language websites
Leeds University Russian Internet Corpus free to search 160 million word internet corpus of Russian websites

Tools and software

Name/link Access Summary
AntConc Concordancer free A freeware corpus analysis toolkit for concordancing and text analysis
CLAWS tagger purchase A part-of-speech tagger for English text.
Xaira Searching Software free Software for indexing and analyzing large XML resources like corpora
ELAN free Tool for the creation of complex annotations on video and audio resources
TYPECRAFT free multilingual Interlinear Glossed Text (IGT) Bank.
FreeLing free FreeLing is a C++ library providing language analysis functionalities for a variety of languages (English, Spanish, Portuguese, Italian, French, German, Russian, Catalan, Galician, Croatian, Slovene, among others).
jEdit free A Java-based text editor that has useful features for fomatting texts for corpus-based research. For example, it accepts many different language encodings, allows for search and replace over multiple files, and features search and replace operations using regular expressions.
Voyant tools free Web-based text reading and analysis environment, with rich array of data visualizations.
ICECUP free ICECUP 3.1 (the ICE Corpus Utility Program) is a state-of-the-art corpus exploration program designed for parsed corpora such as ICE-GB and DCPSE.
ParsPer free This parser POS-tags Persian sentences with up to 87% accuracy, It was trained on the Uppsala Persian Dependency Treebank.
Stanford Parser free This is a statistical parser of English and a few other language developed by Stanford which uses java and is also able to be run from the command line.