-
Notifications
You must be signed in to change notification settings - Fork 2
Design of Maps
Thinking out loud to design how maps will work. Here's a brief ASCII drawing of how some of the datasets play out...
- US Census 2010 (Access SQL as Table Definitions)
--|_ Alabama
--|_ Arkansas
--|_ ...
--|_ Minnesota (Zip, 2.98 GB)
----|_ Summary File 1 (Folder)
------|_ Table 1 (.csv, no Header), Table 2 (.csv), Table 3 (.csv), Table 4 (.csv)
--|_ ...
--|_ Texas
The US Census dataset is massive. It also is probably not necessary to get the entire thing if it can be broken out by state. This is another area where it would be useful to consider how things are mapped... could it (or should it) pull anything less than a state? It could technically pull by tables, but it needs to download the whole file anyway. I'm tempted, for now, to say that the model here is
$miner extract uscensus2010
$miner extract uscensus2010 --only minnesota //or something like this
If $miner extract uscensus2010
is executed, a miner map would need to do the following:
- Download the primary Access database file, convert it into SQL, use it to create a relational database
- Download the Zip of every state's Summary File 1
- Unzip each one (into its own folder, maybe "uscensus/state_name" for each state)
- Iterate over each state and each csv table and insert it into the database (after some basic cleaning to fix the silly limitations of Access format)
- Delete the Summary File
- NYCPolicePenalties (.csv, w/Header)
If $miner extract nycpolicepenalties
is executed, a miner map would need to do the following:
- Download this .csv file from the NYC Open Data site
- Grab the first line of the .csv file and know it is the header, create new database and table
- Iterate over this .csv file and insert it into a database
This one is a relatively simple map, but it sparks the idea of having macro maps or build formulas. For example, what if there was a formula for every NYC Open Data dataset? Could a macro map/build formula simply execute something like $miner extract nycopendata
and that would unpack and install every single open data set?
The same could execute for the US Census data. It does NOT make sense to do it this way because the model is the same for every state in the US Census, just like it would be for every dataset in the NYC Open Data model (the data are different, but all come in a similarly formatted .csv file I think?). Wouldn't it be easier to define one map and then have sub extraction requests?
- Hospital_Comparison (Folder, 52 docs, 125 MB)
--|_ Agency_for_Healthcare_Research_and_Quality_Measures (.csv, w/Header)
--|_ ...
--|_ Hospital_Outcomes_of_Care_Measures (.csv, w/Header)
--|_ ...
--|_ Use of Medical Imaging Measures (.csv, w/Header)
This model differs slightly from the above model. Here we need to:
- Download the hospital comparison data
- Unzip the data into a folder "hospital_comparison/"
- Create a database for this data together
- Iterate over each .csv file in the folder
- For each .csv file, grab the header, create a table using the header to determine column names
- Insert data into the table
The difference here from the NYC Open Dataset above is that this one has multiple .csv files to process, but all work in the same way. This and the NYC Open Dataset represent the benefits of open data -- these maps should be relatively simple and could look almost exactly the same, plus the extra iteration over all the files in a folder (which could easily be done for the NYC Open Dataset, the iterator would just run once).
- manifest.csv (.csv, w/Header)
This is a MASSIVE file and it doesn't actually comprise the whole dataset. To get the whole dataset, we'd actually need to download the 2012 Extract. These would maybe be two separate tables in one database? I haven't looked closely enough here to tell yet.
- MainEnronEmailFolder
--|_ DELETIONS.txt
--|_ maildir
----|_ ...
----|_ hernandez-j
------|_ discussion_threads
--------|_ 234 (text file of email)
--------|_ 409 (text file of email)
--------|_ ... (text files of emails)
------|_ sent_mail
--------| 256 (text file of email)
------|_ sent
------|_ all_documents
------|_ deleted_items
----|_ scott-s
------|_ discussion_threads
--------|_ 690 (text file of email)
--------|_ 103 (text file of email)
--------|_ ... (text files of emails)
----|_ ward-k
------|_ citizens_utilities (list of emails, custom folder)
------|_ pnm (list of emails, custom folder)
------|_ bhp (list of emails, custom folder)
The Enron Email dataset definitely throws a wrench into the neat cogs we've set up so far. Each email is stored in separate folders which contain semantic information we'll care to know. I can imagine doing this in two ways:
- Build a relational database where user, file location, and emails become relationally linked tables.
- Use a doc store (probably easier to run interesting queries against?) and store metadata for each email (less clear on this one).
Either way, it breaks the relatively neat Map model we've been setting up because we need to process the data very differently. Perhaps this will be the exception to the rule and we can just write a separate processor for this, but I think it's worth thinking through what it will mean.
A growing trend is doing textual analysis of literature. It'd be great to easily be able to install a load of book files. How would we do that we Project Gutenberg in this model?