Skip to content

Latest commit

 

History

History
53 lines (41 loc) · 3.29 KB

README.md

File metadata and controls

53 lines (41 loc) · 3.29 KB

warc_downloader

This project is a Python script that Archive-It partners can use to download their WARC files and associated metadata.

Overview

This script uses Archive-It's Web Archiving Systems API (WASAPI) and Partner API to download WARC files and associated metadata. The code was developed as part of a Professional Experience project at the UBC iSchool for use by UBC Library Digital Initiatives, with the goal of digitally preserving WARC files captured using Archive-It.

Because the files will be preserved in Archivematica, the script organizes downloads in the following Submission Information Package (SIP) structure:

  • ARCHIVEIT_COLLECTION-<collection number>_JOB-<crawl ID>
    • metadata
      • submissionDocumentation
      • <host-list csv>: list of host names and summary data from hosts report
      • <mimetype-list csv>: list of mimetypes and summary data from file types report
      • <seed-list csv>: list of seed URLs and summary data from seed report
      • objects
        • <WARC file(s)>

Each package contains one crawl's WARC files and administrative metadata. At present, descriptive metadata is not downloaded by this script.

Prerequisites

  1. Python 3
  2. pipenv

Dependencies

Project Files

Filename Description
warc_downloader.py Main script
Pipfile Pipfile containing dependencies
credentials.env Example file – edit with your Archive-It credentials

Setup

  1. Clone or download this repository
  2. Run pipenv install within the project folder
  3. Edit credentials.env, replacing sampleUsername and samplePassword with your Archive-It credentials

Execution

  1. Run pipenv run python warc_downloader.py
  2. Follow the prompts provided:
Prompt Notes
Enter collection number: Enter the collection number from which to download WARC files.
Would you like to narrow further by date? Enter y or n: y to provide a date range for which WARC files to download, n to proceed with current results.
If a collection has > 100 files, the initial query will only return 100 files, and you will be required to narrow the results by date.
Enter a start date (YYYY-MM-DD): Enter the earliest date for which to retrieve WARC files.
Enter an end date (YYYY-MM-DD): Enter the latest date for which to retrieve WARC files.
Note that the end date is not inclusive. For example, to get all files from 2019, use start date 2019-01-01 and end date 2020-01-01.
Download files? Enter y or n: y to download files, n to exit.
  1. As the files download, scan for any output in red text. The script will indicate if there is any file corruption (md5 checksum did not match) or missing metadata files.