Skip to content
mduering edited this page Jul 1, 2019 · 2 revisions

FAQs

The corpus

newspapers as historical source + lit. references for CH and LU newspapers

Which newspapers do you have in your corpus?

In a nutshell, the impresso corpus contains the historical newspaper collections of the Swiss National Library, the National Library of Luxembourg, the Neue Zürcher Zeitung, Le Temps, the Valais State Archives and the Swiss Economic Archives. We recommend that you take a closer look at our overview of newspapers.

Why can’t I see everything? How do I get access to the full corpus?

For legal reasons we can only show a subset of the newspapers. To gain access to the whole collection, you need to sign a Non-Disclosure-Agreement (NDA) which is available for download here. We will provide you with a user account once we received the signed NDA back from you.

Can I download everything? Do you have an API?

impresso users can download text and metadata for a maximum of 10.000 articles in form of a .csv file to allow - for example - further processing topic modeling on personally curated corpora. For advanced users we provide access via an API. If this is of interest to you, please contact us at info@impresso-project.ch.


Computational processing of historical newspapers

What is OCR?

What is Topic Modeling?

And how was it applied to this collection?

What is a Named Entity?

Named entities are defined entities, that means identifiable persons, institutions, locations. The important criteria is here the name: to differentiate for instance a common noun such as “pope” from the mention of a particular named entity such as “Pope Francis”. The automated recognition of named entities (NER) works very well for born-digital texts but poses challenges when applied to historical, often imperfect text. NER automatically detects mentions of e.g. a person in a text. In a second step we try to link it to a large database of already identified entities. This allows us to link one mention of a person named “Winston Churchill” to the former British prime minister across the corpus. The improved automated recognition of named entities in historical texts is one of impresso’s research objectives.


Errors and Feedback

I have noticed mistakes in the texts and among the entities. Where do they come from?

We use state-of-the-art tools to improve the quality of the OCR and to identify persons, locations and institutions. Inevitably, they fail sometimes and make mistakes which we need to remain aware of. But we believe that despite these imperfections, the opportunities offered by the automated enrichment of historical texts by far outweighs these downsides.

I have a problem / I would like to report an error / I would like to give feedback

The impresso interface remains under active development and we will add new features to the interface in the common months. We always look forward to hearing from you and to learn how you made use of impresso’s tools. To leave us feedback, please click on the black envelope on the lower right of the interface. We will get back to you soon after.


About impresso

Who is behind impresso?

The impresso project is a Swiss-Luxembourgish research project dedicated to the computational enrichment of historical newspapers and the development of new workflows for (digital) historians. The core team consists of computational linguists, designers/developers as well as historians based at the DHLAB of the École polytechnique fédérale de Lausanne (EPFL), the Institute of Computational Linguistics at the University of Zurich and the Luxembourg Centre for Contemporary and Digital History (C2DH). The project is funded by the Swiss National Science foundation (Grant CR- SII5_173719). Take a look at our project homepage for more details.