% Title % Daniel Wheeler % 2014-04-27
Simulation and Metadata Management
Daniel Wheeler • April 29, 2014 • Diffusion Workshop
Automate
scientific/academic code developer
run/manage simulations (code monkey)
an epic Pythonista (according to OSRC)
FiPy developer
interested in reproducible research, see
@wd15dan
A declarative metadata standard
that you can use to tell a Linux VM how to download your data, execute your computational analysis, and spin up an interface to a literate computing environment with the analysis preloaded. Then we can provide buttons on scientific papers that say "run this analysis on Rackspace! or Amazon! Estimated cost: $25".
Automated integration tests for papers
where you provide the metadata to run your analysis while you're working on your paper and a service automatically pulls down your analysis source and data, runs it, and generates your figures for you to check. Then when the paper is ready to submit, the journal takes your metadata format and verifies it themselves, and passes it on to reviewers with a little "reproducible!" tick mark.
ideas by *C. Titus Brown*
maintains history of workflow changes
but not workflow usage
already integrated into the scientific development process
$ git init
$ git add file.txt
$ git commit -m "add file.txt"
$ edit file.txt
$ git commit -am "edit file.txt"
$ git log
12e3c2618143 add file.txt
e00433e69a43 edit file.txt
$ git push github master
Manage Complexity {#managecomplexity .step data-y=3350 data-x=-350 data-rotate-z="-45" data-scale=0.2}
provide a **unique ID (SHA checksum)** for every workflow execution
capture **metadata**, not data
**not** workflow control or version control
partial solution: **Sumatra**, a simulation management tool (not workflow)
**doesn't change my workflow**
records the **metadata** (not the data): parameters, environment, data location, time stamps, commit message, duration, data hash
generates **unique ID** for each simulation
$ smt init smt-demo
$ smt configure --executable=python --main=script.py
$ # python script.py params.json
$ smt run --tag=demo --reason="create demo record" params.json wait=3
Record label for this run: '0c50797f1e3f'
No data produced.
Created Django record store using SQLite
$ smt list --long
------------------------------------------------------------------------
Label : 6c9c7cd2bbc2
Timestamp : 2014-04-21 16:07:52.100838
Reason : create demo record
Outcome :
Duration : 3.26091217995
Repository : GitRepository at /home/wd15/git/diffusion-worksho ...
Main_File : script.py
Version : 08d04df6a9b561eb146d3a7461f763869fdc48a7
Script_Arguments : <parameters>
Executable : Python (version: 2.7.6) at /home/wd15/anaconda/bi ...
Parameters : {
: "wait": 3
: }
Input_Data : []
Launch_Mode : serial
Output_Data : []
User : Daniel Wheeler <daniel.wheeler2@gmail.com>
Tags : demo
Repeats : None
high level data manipulation
quickly mix parameters, metadata and output data in a dataframe
save Sumatra records as HDF file
disseminate instantly using [nbviewer.ipython.org](http://nbviewer.ipython.org/)
$ smt export
$ ipython
>>> import json, pandas
>>> with open('.smt/records_export.json') as f:
... data = json.load(f)
>>> df = pandas.DataFrame(data)
>>> custom_df = df[['label', 'duration', 'tags']]
>>> custom_df
label duration tags
0 6c9c7cd2bbc2 3.260912 [demo]
1 db8610f0c51f 3.248754 [demo]
2 0fdaf12e0cb2 3.247553 [demo]
...
>>> custom_df.to_hdf('records.h5')
cloud service for Sumatra
integrated with Github, Buildbot and a VM provider
**sumatra-server 0.1.0** is out!
slides: [wd15.github.io/diffusion-workshop-2014](http://wd15.github.io/diffusion-workshop-2014/)
parallel demo: [github.com/wd15/smt-demo](https://github.com/wd15/smt-demo)