Design of Maps

Thinking out loud to design how maps will work. Here's a brief ASCII drawing of how some of the datasets play out...

US Census

US Census 2010 (Access SQL as Table Definitions)
--|_ Alabama
--|_ Arkansas
--|_ ...
--|_ Minnesota (Zip, 2.98 GB)
----|_ Summary File 1 (Folder)
------|_ Table 1 (.csv, no Header), Table 2 (.csv), Table 3 (.csv), Table 4 (.csv)
--|_ ...
--|_ Texas

The US Census dataset is massive. It also is probably not necessary to get the entire thing if it can be broken out by state. This is another area where it would be useful to consider how things are mapped... could it (or should it) pull anything less than a state? It could technically pull by tables, but it needs to download the whole file anyway. I'm tempted, for now, to say that the model here is

$miner extract uscensus2010
$miner extract uscensus2010 --only minnesota //or something like this

If $miner extract uscensus2010 is executed, a miner map would need to do the following:

Download the primary Access database file, convert it into SQL, use it to create a relational database
Download the Zip of every state's Summary File 1
Unzip each one (into its own folder, maybe "uscensus/state_name" for each state)
Iterate over each state and each csv table and insert it into the database (after some basic cleaning to fix the silly limitations of Access format)
Delete the Summary File

NYC Police Penalties

NYCPolicePenalties (.csv, w/Header)

If $miner extract nycpolicepenalties is executed, a miner map would need to do the following:

Download this .csv file from the NYC Open Data site
Grab the first line of the .csv file and know it is the header, create new database and table
Iterate over this .csv file and insert it into a database

This one is a relatively simple map, but it sparks the idea of having macro maps or build formulas. For example, what if there was a formula for every NYC Open Data dataset? Could a macro map/build formula simply execute something like $miner extract nycopendata and that would unpack and install every single open data set?

The same could execute for the US Census data. It does NOT make sense to do it this way because the model is the same for every state in the US Census, just like it would be for every dataset in the NYC Open Data model (the data are different, but all come in a similarly formatted .csv file I think?). Wouldn't it be easier to define one map and then have sub extraction requests?

Official Hospital Comparison from Medicare

Hospital_Comparison (Folder, 52 docs, 125 MB)
--|_ Agency_for_Healthcare_Research_and_Quality_Measures (.csv, w/Header)
--|_ ...
--|_ Hospital_Outcomes_of_Care_Measures (.csv, w/Header)
--|_ ...
--|_ Use of Medical Imaging Measures (.csv, w/Header)

This model differs slightly from the above model. Here we need to:

Download the hospital comparison data
Unzip the data into a folder "hospital_comparison/"
Create a database for this data together
Iterate over each .csv file in the folder
For each .csv file, grab the header, create a table using the header to determine column names
Insert data into the table

The difference here from the NYC Open Dataset above is that this one has multiple .csv files to process, but all work in the same way. This and the NYC Open Dataset represent the benefits of open data -- these maps should be relatively simple and could look almost exactly the same, plus the extra iteration over all the files in a folder (which could easily be done for the NYC Open Dataset, the iterator would just run once).

US Form 990 Manifest (1.2 GB)

manifest.csv (.csv, w/Header)

This is a MASSIVE file and it doesn't actually comprise the whole dataset. To get the whole dataset, we'd actually need to download the 2012 Extract. These would maybe be two separate tables in one database? I haven't looked closely enough here to tell yet.

EnronEmail (443 MB compressed)

MainEnronEmailFolder
--|_ DELETIONS.txt
--|_ maildir
----|_ ...
----|_ hernandez-j
------|_ discussion_threads
--------|_ 234 (text file of email)
--------|_ 409 (text file of email)
--------|_ ... (text files of emails)
------|_ sent_mail
--------| 256 (text file of email)
------|_ sent
------|_ all_documents
------|_ deleted_items
----|_ scott-s
------|_ discussion_threads
--------|_ 690 (text file of email)
--------|_ 103 (text file of email)
--------|_ ... (text files of emails)
----|_ ward-k
------|_ citizens_utilities (list of emails, custom folder)
------|_ pnm (list of emails, custom folder)
------|_ bhp (list of emails, custom folder)

The Enron Email dataset definitely throws a wrench into the neat cogs we've set up so far. Each email is stored in separate folders which contain semantic information we'll care to know. I can imagine doing this in two ways:

Build a relational database where user, file location, and emails become relationally linked tables.
Use a doc store (probably easier to run interesting queries against?) and store metadata for each email (less clear on this one).

Either way, it breaks the relatively neat Map model we've been setting up because we need to process the data very differently. Perhaps this will be the exception to the rule and we can just write a separate processor for this, but I think it's worth thinking through what it will mean.

Project Gutenberg Books

A growing trend is doing textual analysis of literature. It'd be great to easily be able to install a load of book files. How would we do that we Project Gutenberg in this model?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly