This example creates and tests a SVM classifier for the digits dataset using scikit-learn.
The original experiment is composed by four steps:
- Data Collection: the digits dataset is collected using scikit-learn and the data is split into training and test sets.
- Classification: a SVM classifier is created using the training set.
- Prediction: the SVM classifier is tested using the test set.
- Confusion Matrix: a confusion matrix is created using the prediction data.
The confusion matrix is written to standard output as:
Confusion matrix:
[[87 0 0 0 1 0 0 0 0 0]
[ 0 88 1 0 0 0 0 0 1 1]
[ 0 0 85 1 0 0 0 0 0 0]
[ 0 0 0 79 0 3 0 4 5 0]
[ 0 0 0 0 88 0 0 0 0 4]
[ 0 0 0 0 0 88 1 0 0 2]
[ 0 1 0 0 0 0 90 0 0 0]
[ 0 0 0 0 0 1 0 88 0 0]
[ 0 0 0 0 0 0 0 0 88 0]
[ 0 0 0 1 0 1 0 0 0 90]]
To run this experiment without ReproZip, you will need to install scikit-learn and run each script with Python, in the aforementioned order.
The ReproZip package is available here (20.0 MB).
All the steps of the experiment can be reproduced as follows:
$ reprounzip vagrant setup digits_sklearn.rpz digits/
$ reprounzip vagrant run digits/
Optionally, you can also reproduce each step individually:
$ reprounzip vagrant run digits/ get_data
$ reprounzip vagrant run digits/ build_classifier
$ reprounzip vagrant run digits/ predict
$ reprounzip vagrant run digits/ evaluate
The VisTrails Workflow
The digits-sklearn experiment is a great example of how you can easily extend the original pipeline to further analyze the results, or even reuse it in your own research.
Recall that ReproZip automatically generates a VisTrails workflow for the experiment given that reprounzip-vistrails is installed. This workflow is located under digits/vistrails.vt
.
You can replace it with the one we provide here to see how the workflow can be extended to enhance the reproducibility experience. After opening it, replace the value of the directory parameter (from module Directory) with the full path of digits/
, and then run the workflow by pressing Execute. You will see that the workflow was extended to provide visualization for the predictions and the confusion matrix.
If you are using our demo VM image, you can run the following:
$ vagrant ssh
$ workon digits-sklearn
$ cd reprozip-examples/digits-sklearn/
$ python 01_getdata.py
$ python 02_classifier.py
$ python 03_predict.py
$ python 04_confusion_matrix.py