Tech Group Goals for Dec. 17, 2016 Guerilla Archiving

In each heading, describe the goal and the desired deliverables. List in priority sequence, if possible.

Workflow & Scripts for the EPA Publications Collection

It now looks like the IA will harvest all of the EPA Publications (listed here) through the regular webcrawling process. The webcrawler will not retrieve metadata, but that metadata is recorded in extracted textfiles and possibly in PDF’s as well.

Anujan Murugesu has written a code snippet that will do most of the work:

import re
import requests

docpattern = re.compile("(Dockey=)((\w)*)")
docFile = open("docKey.txt", 'w')
linkFile = open("links.txt", 'w')

dataSite = requests.get("https://nepis.epa.gov/EPA/html/pubs/pubtitle.html").text
matches = docpattern.findall(dataSite)
for match in matches:
    a = match[1]
    fileLink = "https://nepis.epa.gov/Exe/ZyPDF.cgi/" +a + ".PDF?Dockey=" + a + ".pdf"
    docFile.write(a)
    linkFile.write(fileLink)
    linkFile.write("\n")
    docFile.write("\n")
docFile.close()
linkFile.close()

Deliverables

Verification: do the PDF’s themselves contain metadata?
Code: write a scraper/crawler that can extract metadata from text & PDF files and save them to a simple DB for the use of future researchers. (mostly done already!)

Priority

The document collection is very high priority but the IA will almost certainly scrape this by themselves. Flagging this collection as worthwhile and rendering it useful by developing a database are extrmeely important, but not time sensitive.

Scraper for the EPA EIS Collection

Environmental Impact Statements or EIS are a crucial tool in environmental regulation. They are accessible here, and leaving all search criteria blank returns a full listing. Documents are accessible at landing pages two clicks down, which look like this. These landing pages contain metadata which we should scrape.

Deliverables

Code: write a scraper/crawler that can extract metadata form text & PDF files & save them to a simple DB for future researchers.

Priority

EIS are a high-value document source, but it looks like they will also be retrieved by the official IA crawl. So this is not time-sensitive.

Scraper for Environmental Compliance Documents

The Environmental Compliance History database is online here.These documents are more complex to search, for two reasons:

not all docs in this db are of the same type
The query results interface is sort of fnacy and slow, hard to read, and contains lots of metadata.

Deliverables

Code: A crawler/scraper that can crawl through the interface and capture as much information as possible, including metadata. As the IA is unlikely to capture this info, we should ALSO attempt to store all results as WARC files which can be passed on to the IA

Priority

This is high priority data which the IA will likely not be able to attain, so this is a *high priority for us. However, it’s harder than the first two.

FTP Crawling

Many of the publications and datasets published online by the EPA are served over FTP, e.g., as seen here. Can we retrieve these?

Deliverables

Code: A proof-of-concept FTP crawler for this dataset
Text: Assessment: guide for other collaborators dealing with the same problem

Priority

Middle Priority. This problem is one that must be solved eventually.

Developing Guidelines for Storage, Server Space, Mirroring, etc.

We have received many offers of server space donations, and many suggestions about the appropriate way to store this data (e.g., different protocols, mirroring strategies, etc.). Meanwhile, a number of independent parties are setting up their own archives and mirrors of datasets. As a distributed network of volunteers, we need to figure out ways to make these archives robust, accessible, and durable. This is a hard problem!

It is unlikely that U of T will take a leadership role in the actual archiving of data, so our role at this event is to make some first steps in identifying best practices for this part of the project. If anyone has substantial skills in this area the librarian team could use your help developing some guidelines and standards for our data storage serving, mirroring, and archiving practices, and for developing webs of trust in a distributed network.

Deliverables

Text: Brief report assessing available options for serving large amounts of data in a sustainable, future-proof, government-proof way.

Priority

Moderate, and if expertise is presence this should probably be the highest priority of those individuals.

Protocols for Storage of Quantitative Datasets

The retrieval and storage of quantitative datasets is a complex problem, and currently in flux. The IA is hoping to get direct access to DB dumps or even possibly VM clones for the web interfaces to the currently-available datasets, but no concrete arrangements have been achieved and there is currently no guarantee that their efforts will be successful.

We therefore need to answer a number of very urgent questions:

How do we assess the priority of particular datasets, and target the highest-priority datasets first?
What mechanisms are available for dataset capture (e.g., reverse-engineering via web interfaces, partial downloads, and glue code; or, alternatively, walking into an EPA office and requesting digital copies of the whole thing)? What are the risks and benefits of each?
What kind of metadata needs to be associated with each DB we retrieve (e.g., timestamps, provenance, etc.)?

These are complex questions which we will not resolve in our event. However, we would like to make some progress

Deliverables

Text: A preliminary rubric outlining answers to these questions, and suggesting some paths towards answers. We’ll need help from some library people on this one.
Code: If possible, we would like to test-run a database retrieval and identify low-hanging fruit as well as potential difficulties. We would choose a dataset from the EPA Dataset Gateway or a particular dataset identified as high priority.

Priority

Development of the protocols and standards is a high priority. Test-driving code is more of a wishlist.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tech-Group-Goals.org

Tech-Group-Goals.org

Tech Group Goals for Dec. 17, 2016 Guerilla Archiving

Workflow & Scripts for the EPA Publications Collection

Deliverables

Priority

Scraper for the EPA EIS Collection

Deliverables

Priority

Scraper for Environmental Compliance Documents

Deliverables

Priority

FTP Crawling

Deliverables

Priority

Developing Guidelines for Storage, Server Space, Mirroring, etc.

Deliverables

Priority

Protocols for Storage of Quantitative Datasets

Deliverables

Priority

Files

Tech-Group-Goals.org

Latest commit

History

Tech-Group-Goals.org

File metadata and controls

Tech Group Goals for Dec. 17, 2016 Guerilla Archiving

Workflow & Scripts for the EPA Publications Collection

Deliverables

Priority

Scraper for the EPA EIS Collection

Deliverables

Priority

Scraper for Environmental Compliance Documents

Deliverables

Priority

FTP Crawling

Deliverables

Priority

Developing Guidelines for Storage, Server Space, Mirroring, etc.

Deliverables

Priority

Protocols for Storage of Quantitative Datasets

Deliverables

Priority