The Archive for Danish Literature, ADL, comes to you via a collaboration between
As of writing, the corpus comprises 498 volumes with in total 165512 pages of Danish literature. The whole corpus has been encoded using TEI, but only about two-thirds of the pages have been subject to OCR and text encoding. This repository contains all those texts.
We also describe our data and particular our encoding practices. We also give information on how we envisage submissions could be structured.
- The ADL work
- Connecting works with metadata
- Submission of documents connecting text to facsimile
- Connecting text to facsimile
- Workflows and requirements for new documents
As might have noticed are all the texts in a XML format called Text Encoding Initiative (TEI). For many purposes, if not all, that is a good format.
If you want to extract texts from the files, you can use the the scripts
The first one (get_titles.xsl
) creates a list of works inside a TEI file.
xsltproc get_titles.xsl texts/hcaeventyr01val.xml
workid57967;Eventyr, fortalte for Børn. Første Samling. Første Hefte. 1885.
workid58084;Fyrtøiet
workid59091;Lille Claus og store Claus
workid61051;Prindsessen paa Ærten
workid61317;Den lille Idas Blomster
workid62461;Eventyr, fortalte for Børn. Første Samling. Andet Hefte. 1885.
workid62544;Tommelise
workid64209;Den uartige Dreng
workid64656;Reisekammeraten
...
The second script (get_the_text.xsl
) creates one text file per title in the TEI file.
Finally, you can adapt the shell script extract_stuff.sh
to do both things directly.
Projects with relevant scope can contribute documents to ADL, provided the
- Copyright issues are resolved
- They are accepted by DSL and KB
- The XML is valid TEI
A contribution can be received by branch and pull request in github as is the practice on GitHub.