Skip to content

A structured list of text corpora, created for use with a corpus downloader.

Notifications You must be signed in to change notification settings

JonathanReeve/corpus-list

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 

Repository files navigation

Corpus List

A structured list of textual corpora, created for use with corpus downloader. This list is read by the Python module and command-line program corpus-downloader, originally developed for use in DHBox.

How to Add Your Corpus

Do you own or maintain a textual corpus? Please fork this repository and add it to the list. Here are a list of current fields and their descriptions:

shortname: a short and easy-to-type name for your corpus, e.g. shc for the corpus “Shakespeare His Contemporaries.” title: the full name for your corpus, e.g. “Shakespeare His Contemporaries.” categories: a disciplinary label you can assign to your corpus. If if it’s mostly of interest to classics scholars, write “classics.” Current values include “literature,” “classics,” and “history,” but feel free to add your own. languages: The ISO 649-2 language code for the language(s) of your corpus. Examples include deu (German), eng (English), enm (Middle English), and fra (French). For a more complete list, check here. text: information about the actual text(s) of your corpus. markup: how the text is encoded. E.g. TXT (plain text), HTML (HTML files), or TEI (TEI XML). url: the URL for your textual corpus, e.g. http://www.folgerdigitaltexts.org/download/txt/FolgerDigitalTexts_TXT_Complete.zip file-format: the file format of the URL above. In the above example, zip.

- shortname: perseus-c-greek
  title: Perseus Canonical Greek
  categories: classics
  authors: multiple
  languages: grc
  text: 
    markup: TEI
    url: https://github.com/PerseusDL/canonical-greekLit.git
    file-format: git

About

A structured list of text corpora, created for use with a corpus downloader.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published