In each heading, describe the goal and the desired deliverables. List in priority sequence, if possible.
It now looks like the IA will harvest all of the EPA Publications (listed here) through the regular webcrawling process. The webcrawler will not retrieve metadata, but that metadata is recorded in extracted textfiles and possibly in PDF’s as well.
Anujan Murugesu has written a code snippet that will do most of the work:
import re
import requests
docpattern = re.compile("(Dockey=)((\w)*)")
docFile = open("docKey.txt", 'w')
linkFile = open("links.txt", 'w')
dataSite = requests.get("https://nepis.epa.gov/EPA/html/pubs/pubtitle.html").text
matches = docpattern.findall(dataSite)
for match in matches:
a = match[1]
fileLink = "https://nepis.epa.gov/Exe/ZyPDF.cgi/" +a + ".PDF?Dockey=" + a + ".pdf"
docFile.write(a)
linkFile.write(fileLink)
linkFile.write("\n")
docFile.write("\n")
docFile.close()
linkFile.close()
- Verification: do the PDF’s themselves contain metadata?
- Code: write a scraper/crawler that can extract metadata from text & PDF files and save them to a simple DB for the use of future researchers. (mostly done already!)
The document collection is very high priority but the IA will almost certainly scrape this by themselves. Flagging this collection as worthwhile and rendering it useful by developing a database are extrmeely important, but not time sensitive.
Environmental Impact Statements or EIS are a crucial tool in environmental regulation. They are accessible here, and leaving all search criteria blank returns a full listing. Documents are accessible at landing pages two clicks down, which look like this. These landing pages contain metadata which we should scrape.
- Code: write a scraper/crawler that can extract metadata form text & PDF files & save them to a simple DB for future researchers.
EIS are a high-value document source, but it looks like they will also be retrieved by the official IA crawl. So this is not time-sensitive.
The Environmental Compliance History database is online here.These documents are more complex to search, for two reasons:
- not all docs in this db are of the same type
- The query results interface is sort of fnacy and slow, hard to read, and contains lots of metadata.
- Code: A crawler/scraper that can crawl through the interface and capture as much information as possible, including metadata. As the IA is unlikely to capture this info, we should ALSO attempt to store all results as WARC files which can be passed on to the IA
This is high priority data which the IA will likely not be able to attain, so this is a *high priority for us. However, it’s harder than the first two.
Many of the publications and datasets published online by the EPA are served over FTP, e.g., as seen here. Can we retrieve these?
- Code: A proof-of-concept FTP crawler for this dataset
- Text: Assessment: guide for other collaborators dealing with the same problem
Middle Priority. This problem is one that must be solved eventually.
We have received many offers of server space donations, and many suggestions about the appropriate way to store this data (e.g., different protocols, mirroring strategies, etc.). Meanwhile, a number of independent parties are setting up their own archives and mirrors of datasets. As a distributed network of volunteers, we need to figure out ways to make these archives robust, accessible, and durable. This is a hard problem!
It is unlikely that U of T will take a leadership role in the actual archiving of data, so our role at this event is to make some first steps in identifying best practices for this part of the project. If anyone has substantial skills in this area the librarian team could use your help developing some guidelines and standards for our data storage serving, mirroring, and archiving practices, and for developing webs of trust in a distributed network.
- Text: Brief report assessing available options for serving large amounts of data in a sustainable, future-proof, government-proof way.
Moderate, and if expertise is presence this should probably be the highest priority of those individuals.
The retrieval and storage of quantitative datasets is a complex problem, and currently in flux. The IA is hoping to get direct access to DB dumps or even possibly VM clones for the web interfaces to the currently-available datasets, but no concrete arrangements have been achieved and there is currently no guarantee that their efforts will be successful.
We therefore need to answer a number of very urgent questions:
- How do we assess the priority of particular datasets, and target the highest-priority datasets first?
- What mechanisms are available for dataset capture (e.g., reverse-engineering via web interfaces, partial downloads, and glue code; or, alternatively, walking into an EPA office and requesting digital copies of the whole thing)? What are the risks and benefits of each?
- What kind of metadata needs to be associated with each DB we retrieve (e.g., timestamps, provenance, etc.)?
These are complex questions which we will not resolve in our event. However, we would like to make some progress
- Text: A preliminary rubric outlining answers to these questions, and suggesting some paths towards answers. We’ll need help from some library people on this one.
- Code: If possible, we would like to test-run a database retrieval and identify low-hanging fruit as well as potential difficulties. We would choose a dataset from the EPA Dataset Gateway or a particular dataset identified as high priority.
Development of the protocols and standards is a high priority. Test-driving code is more of a wishlist.