Skip to content

Technical documentation

zlodej_papiru edited this page Dec 19, 2023 · 2 revisions

Dead Web Resources Database - Technical Documentation

The specialized public database Database of Dead Web Resources displays in a structured form the data about extinct web resources archived by the Webarchive of the National Library of the Czech Republic. The database is integrated into the curatorial application Seeder and displays data obtained through the Extinct Websites application. Both applications were designed as open source, source code and documentation are freely available in github repositories. Information about web resources that are identified as dead by the Extinct Websites application is transferred to Seeder via the API and interpreted by the public via the Dead Web Resources Database - https://www.webarchiv.cz/mrtve-weby.

Main components of the dedicated database

  • Seeder - curated application for managing resources, harvests and Webarchive websites
  • Extinct Websites - application for automated solution for identification and description of dead web resources

Seeder

The database of dead web resources is integrated into the curatorial application Seeder, which is used to manage web resources, license agreements, the register of web publishers, the harvesting schedule and the administration of the Web Archive of the National Library of the Czech Republic. The application was developed to meet the needs of the Webarchive. It is written in Python programming language using Django framework. It uses the PostgresSQL database to store its data.

Data from Extinct Websites is regularly uploaded to the Seeder application using a custom REST API. The data is then displayed to users of the application in the form of statistics, tables, and an interactive chart using the Chart.js library on the Webarchive website.

Technical documentation for seeder

Extinct Websites

The dead web resources database displays data collected by the Extinct Websites application, which serves as an automated solution for identifying and describing dead websites. The application stores the data in its own internal database and makes it available to curators, who further manipulate, interpret and classify the content. The Extinct Websites application identifies dead sites using status codes, which categorize the sites into groups that automate other processes, such as validating metadata from live sites, the WhoIS database, or historical metadata.

Technical documentation for Extinct Websites

API description

The basic description of the api is on the Extinct Websites wiki

To connect to the Extinct Web Resources Database, it is important to assign a seeder value to the type parameter, ie:

http://url-aplikace/api/v2/?type=seeder

Dead Web Resources Database

Dead Web Resources Database: https://www.webarchiv.cz//mrtve-weby

The aim of the Dead Web Resources Database is to report on the disappearance of web content and to provide statistics that will give an idea of the trends of disappearing web resources over time. The database interprets data obtained from the Extinct Websites application, the purpose of which is to set up an agenda for long-term regular tracking of disappearing web resources and recording of relevant metadata. The creation of the database was preceded by research and methodological understanding of the concept of dead web resources, described in the article About dead web resources. How to identify and track defunct web content? The research shows that, given the changing nature of the web, the dead web cannot be strictly defined. When interpreting data from the database, it is therefore important to remember that no website can be definitively labelled as dead. A so-called Deadness Index has been proposed to determine the state of web resources. It is an automated solution, individual values are continuously adjusted according to the needs arising from practice.

The database contains several parameters based on the API of our Extinct websites application. The list is listed below:

  • URL - a list of URLs of resources linked to archive copies of the Webarchive
  • Date of death detected - the date we detected the death of the web resource
  • Status code - the last detected HTTPS code
  • The date from which the web resource was recorded
  • Deadness index - an index we use to evaluate the threat to the resource, which consists of several parameters (page metadata, page content, information from the whois database)

An important display element is a graph that allows to show the death of sites over time. The website also displays as a main figure the number of dead websites identified so far and the percentage of the total number of monitored websites. We also provide the user with two tables, both of which can be saved as CSVs - a list of dead sites overall and a list of all tracked sites. Each year, the statistics of the death of websites of that year are also saved, as the statistics so far are generated dynamically.