Skip to content

How to Get Data Sets for Python Tests

Kevin Normoyle edited this page Sep 6, 2013 · 46 revisions

What problem is being solved here?

If you just want to run tests on your laptop, or some single machine,

you want to run a large variety of tests, sourcing data from your laptop. You need to get datasets to do that. There are a smaller number of tests that source data from hdfs in the 0xdata network, but that is mostly for testing hdfs, not general h2o algo behaviors etc. There are some tests that source data directly from s3n all the time, so those need to run on ec2, and are also not interesting (your goal was to run tests on your laptop). You'll be interested in tests in the testdir_single_jvm, testdir_multi_jvm, and testdir_hosts directories.

If you want to run tests on remote machines,

i.e. 161-180, they are already setup for 'home-0xdiag-datasets" buckets. But since h2o 'putfile' sources data from the machine running python, you still need to get datasets on that machine too (typically your laptop). The majority of tests use the h2o 'putfile" method. (it's not parallel, so not fastest, but it's good for up to 4GB or so).

All the tests can run on remote machines, using the -cj .json argument. Or if running in a directory that has a pytest_config-.json that matches your username.

tests are "binned" somewhat on typical run environment

Some tests want to only run in certain environments. Like testdir_0xdata_only or testdir_ec2_only. Some are environment specific and take hours to run. Like testdir_0xdata_slow or testdir_ec2_slow.

Michal has some tests he runs for his own checking in testdir_ec2. Those don't necessarily conform to the import_parse() standard, and may use methods that more directly use the h2o json api.

Other testdirs, or repos with tests, may be created for specific needs like benchmarking, performance testing. For instance, h2o_perf is a repo with some testdirs for KMeans testing for a number of machine configurations. In general, they should follow the guidelines here for managing datasets, but it's not required.

If you want to write tests,

You need to know the current preferred mechanism for specify how to get a dataset into h2o, from python. Then you need to copy the dataset into the right place in s3, and on all local 161-180 machines and the two ec2 machines that run single-node for jenkins(or tell Kevin to).

Once that has been done, and your test works running locally, on 0xdata machines remotely, and in ec2 for single or multi-machine, you can say you've created a new test with a new dataset that fits in the overall testing framework. Or you can use existing datasets. Or you can create synthetic datasets on the fly.

You can push new datasets into h2o smalldata, or datasets. But those have gotten big enough that you really shouldn't add anything big to those anymore.

git pull origin master

Your normal git clone of h2o, gets the java source code for h2o, but also gets the python tests in h2o/py and "small" datasets in h2o/smalldata.

git clone datasets

This gets you another repo, that is larger datasets. It should be placed parallel to the h2o repo above

Other special buckets like 'home-0xdiag-datasets'

We use the phrase 'buckets' like Amazon S3, even though it's not S3. It's not a "place" that marks the start of a directory tree that has datasets. So far, the tests just have granularity for one of these buckets called 'home-0xdiag-datasets'. This directory used to live everywhere as '/home/0xdiag/datasets' and was a fixed path. Now it doesn't need to be fixed. It is also a s3 bucket called 'home-0xdiag-datasets'

All new "buckets" should be named using Amazon-legal bucket names. Notably: no underscores! Feel free to create new buckets, and add information on syncing them at the top of this page. The more granularity the better, if it means people can selectively download things to limit bandwidth and storage demands.

home-0xdiag-datasets is currently not a publicly accessible s3 bucket.

s3cmd is available here and has nice command line stuff for s3 (get/put/sync etc): http://s3tools.org/s3cmd

Cloudberry explorer is a gui tool for windows, available here: http://www.cloudberrylab.com/free-amazon-s3-explorer-cloudfront-IAM.aspx

I use DragonDisk on ubuntu for a gui tool: http://www.dragondisk.com/

I'm using S3 Explorer on linux (it has mac/windows also), and it's really great for copying buckets. Multi-hreaded..really fast (there are multi-threaded python scripts out there for bucket copy, but this gui allows you to pick and chose things in a bucket to copy..i.e. better features). Almost might be worth paying for!: http://www.bucketexplorer.com/be-download.html

If you've got a Mac, I think 3Hub just went to free this year?: http://www.3hubapp.com/

Or you can use the Amazon browser based tools.

There are other s3 tools. Google for 's3 browser' or 's3 explorer'

You probably don't care about all tests yet

The 'home-0xdiag-datasets' bucket on s3 has a lot of stuff. Some isn't used by current python tests. The directories on the 161-180 machines have a smaller set. (I'm currently paring down the s3 bucket and putting the other files in another bucket)

So it may be best to copy from /home/0xdiag/home-0xdiag-datasets on existing 161-180 machines for now.

Looking at the size of subdirectories in home-0xdiag-datasets (from a du):

18990304	./standard
54913032	./mnist
1067408	./libsvm
29452	./ncaa
40690168	./billions
33466380	./manyfiles-nflx-gz

Minimally you'll want

s3://home-0xdiag-datasets/standard  
    equivalently you can cd to your local home-0xdiag-datasets directory and:
    scp -r 0xdiag@192.168.1.180:/home/0xdiag/home-0xdiag-datasets/standard .
s3://home-0xdiag-datasets/libsvm
    equivalently you can cd to your local home-0xdiag-datasets directory and:
    scp -r 0xdiag@192.168.1.180:/home/0xdiag/home-0xdiag-datasets/libsvm .

others:

s3://home-0xdiag-datasets/mnist
s3://home-0xdiag-datasets/allstate
s3://home-0xdiag-datasets/ncaa

The directory 'manyfiles-nflx-gz' has many randomly permuted versions of the one file (gzipped files). It is more effectively created locally by using one file (file_1.dat.gz) and permuting it locally with incrementing numbers in the file name (like at s3).

Now you might be tempted to download one file and copy it..but:

The problem is that because of things like enums forcing cols to NA at certain limits, we really want the "exact" randomly-generated files used the same everywhere (for consistency of test and debug).

The filenames are important so they would need be created exactly the same. But it's just a file copy file_1.data.gz to file_300.dat.gz . The s3 bucket has more than 300, but only 300 are needed for tests.

s3://home-0xdiag-datasets/manyfiles-nflx-gz

This is a case where it may be better to copy the dir from the local machines (higher bandwidth compared to downloading from S3)

Synthetic datasets created by tests

Tests can synthetically generate smallish datasets in a local directory called 'syn_datasets' also. This is cleaned at the start of any test that generates synthetic datasets. On test fail, you can look there to see the last file created, and test it standalone in a browser, for instance.

Side notes on 'sandbox' directory when running python tests.

You can get the exact parameters used in h2o, but looking either in the python stdout, or better yet, looking in sandbox/commands.log after a failure. That has a trace of all urls sent to json, non-encoded so human readable (unlike the browser)

Also, sandbox has stdout and stderr logs for each node in a h2o cloud. They have embedded 0-n numbers so you can tell which stdout/stderr are pairs. They are the redirected h2o stdout/stderr from the remote nodes, captured locally. In general, they should match the h2o logs in ice root, if those are enabled. The python tests 'grep' these logs looking for java exceptions, stack traces, and assertion errors, at various places in tests.

How python tests resolve 'how to find a dataset'

It's possible to send things to h2o using json requests that match what the browser does. In that case h2o typically resolves locations with absolute pathnames. But for tests, we want a test to support a variety of configurations and operational behaviors, while still being just one test text file.

To accomplish this, the tests have all been written using a single method, import_parse(), that does a variety of path resolutions before passing a request on to h2o. This method is in h2o_import2.py (in h2o/py).

So you can break the question into two parts:

  • What information do the tests provide to a python package (h2o_import2.import_parse())
  • How is that info used?

The 3 key bits of info are: "bucket", "path" and "schema".

  • "bucket" is a name in a filesystem that's found through a variety of means. It should be unique for all possible filesystems, within the bounds of the finding algorithms (The finding algorithms don't just scan your entire disk). It must meet Amazon S3 naming rules: no underscores, for instance. "bucket" is optional, allowing for use of absolute or relative paths from current directory (synthetically generated datasets use this)
  • "path" is an offset within a bucket. That gets you a a file or files. It can have regex in the last part of it (the base). The "last part" may be the only part, if there are no "/" separators in the path.
  • "schema" is how the data is uploaded to h2o: put, local, hdfs, maprfs, s3, s3n are legal choices. Note you never pass something that looks like a URI. Just "bucket" and "path"

Other parameters can be passed to import_parse() to affect the python framework, or passed to h2o. Generally kwargs is used for h2o params, by convention.

Looking inside h2o_import2.py

The first def can be used to return a full pathname using the same resolution that a local file will get with import_parse() (if returnFullPath=True). Normally it returns a tuple that is used by import_only/import_parse.

It should be rare that you need to know a full path. An example might be where you want to preprocess a file with a gzip or cat util in python, before h2o gets it.

def find_folder_and_filename(bucket, pathWithRegex, schema=None, returnFullPath=False):

import_only() and parse_only() are used by import_parse(). There are rare cases where they may be used individually. It should be rare. (see tests for examples)

def import_only(node=None, schema='local', bucket=None, path=None,
def parse_only(node=None, pattern=None, hex_key=None,

this is the one everything should use: import_parse(). You'll see it in test.py like this:

parseResult = h2i.import_parse(bucket='smalldata', path='iris/iris2.csv', schema='put')

in h2o_import2.py you can see more params:

def import_parse(node=None, schema='local', bucket=None, path=None,
    src_key=None, hex_key=None,
    timeoutSecs=30, retryDelaySecs=0.5, initialDelaySecs=0.5, pollTimeoutSecs=180, noise=None,
    benchmarkLogging=None, noPoll=False, doSummary=True, **kwargs):

These exist in h2o_import2.py, and may be useful for key management. Not needed for understanding import_parse and how dataset locations are resolved.

def find_key(pattern=None):
def delete_keys(node=None, pattern=None, timeoutSecs=30):
def delete_keys_at_all_nodes(node=None, pattern=None, timeoutSecs=30):
def delete_keys_from_import_result(node=None, pattern=None, importResult=None, timeoutSecs=30):

schema='put' or schema='local' (default)

These are the two schemas people are generally concerned about, since the datasets come from the local machine that is executing the python.

Since 'git' handles the 'datasets' bucket and the smalldata bucket inside h2o, the only bucket people have questions about is 'home-0xdiag-datasets'

You can copy this from /home/0xdiag/home-0xdiag-datasets on any of the 161-180 machines. (currently, 8/31/13, it's a symbolic link to /home/0xdiag/datasets)

Or you can copy from the 0xdata s3 bucket called 'home-0xdiag-datasets'

The ideal tool for getting/putting stuff from/to s3 is 's3cmd'. install it on your system, configure it without secret AWS keys, and you can get/put/sync. Or you can use various s3 browser tools that are available for your system. (I use 'dragondisk' on ubuntu. Cloudberry S3 explorer is nice for Windows.

Now: where to copy it. We support a variety of ways of resolving where it's found on your local machine.

  • You can copy it anywhere and put a symlink in your home directory called "home-0xdiag-datasets'
  • You can copy it anywhere and create an environment variable called H2O_BUCKETS_ROOT that points to the directory in which 'home-0xdiag-datasets' exists.
  • You can copy it somewhere in the path above where you execute the python. Preferably next to 'h2o' from your git clone.

Issues with datasets on remote machines

H2O can get data from hdfs, s3, s3n/hdfs, or import from the local filesystem on a machine (requires identical files and locations for duplication of the data on all machines in a cluster)

So if you execute python tests that send h2o to remote machines, you want to be sure that if the tests use import folder (schema='local') that the local file systems have data in a place that matches it's location on your local system.

Or you can rely on the .json to redirect the path. For instance most of the .json's in use at 0xdata, use username=0xdiag on remote machines. The remote bucket root is redirected in the .json like this:

"username": "0xdiag",
"h2o_remote_buckets_root": "/home/0xdiag",

(other config state exists, but not shown) 'home-0xdiag-datasets' must exist at "/home/0xdiag" as a result.

You can override this resolution for all cases, by setting an environment variable locally: H2O_REMOTE_BUCKETS_ROOT

Running on ec2.

Running single machine tests like in testdir_single_jvm and testdir_multi_jvm will resolve just like on your local machine, so those ec2 machines need to be setup like your local machine.

When multi-machine tests are run, and a .json is used, we allow two additional configs in the .json

'redirect_import_folder_to_s3_path': true,

or

'redirect_import_folder_to_s3n_path': true,

this redirects an import folder from the local filesystem, to look for the matching bucket name in s3/s3n This allows tests written with import folder, to run on multi-machine configs in ec2, without creating a copy of the big datasets locally (they get them from s3). This tests s3 ingest also.

HDFS testing

There is currently no hdfs ingest testing at ec2. It's 0xdata network only.

There are three hdfs clusters

192.161.1.161 (namenode) is a CDH4, three datanode cluster
192.168.1.176 (namenode) is now a nine node CDH3 cluster
192.168.1.171 is a mapr 5 node cluster.

To check out the admin pages on them:

192.168.1.161:7180 (admin/admin)
192.168.1.176:7180  (admin/admin)
https:192.168.1.171:8443 (mapr/mapr) (https is required)

On all three, there are files in /datasets Normally the permissions should be for hdfs, or mapr. All new users are put in a hdfs and hduser and mapr group, so should have access.

If you create new files in those filesystems, it's best to match the user/file permissions of existing files. You may want to become user hduser if you create files there from the command line.

Tests resolve to there with schema='hdfs'

Examples of supported cases with import_parse()

add this to the top of a python test

import h2o_import2

or expand an existing one

import h2o, h2o_import2

then the parse, algo step should be changed to look like this: (this is an RF example, but same idea for your GLM's)

It only looks subtly different compared to what people have used before. I've removed some old methods. (the old h2o_import.py is gone, also the run* methods that combined putfile plus algo. find_dataset goes away, as does parseFile).

For small files, the default timeout is probably fine and don't need to specify. I've set the default schema to 'local' so all putfiles (upload) should use schema='put' (actually it's nice to always specify anyhow, for clarity)

    parseResult = h2i.import_parse(
        bucket='datasets', path='UCI/UCI-large/covtype/covtype.data',
        schema='put', timeoutSecs=30)

    h2o_cmd.runRFOnly(parseResult=parseResult, trees=6, timeoutSecs=35, retryDelaySecs=0.5)

While it doesn't seem like a big deal here, there's other issues that get managed better behind the scenes with this.

The theory is here, that you never use a 'parse' method. Just one of these kind of lines. (remember, source files are deleted after parse now, so you always have to get the source again anyhow)

Here are examples. Note that absolute paths can be specified without a bucket, but discouraged. An environment variable at the OS level, H2O_BUCKETS_ROOT can be set and will allow your bucket to be anywhere on your system, not just looking up from where the python runs.

examples:

    parseResult = h2i.import_parse(path="testdir_multi_jvm/syn_sphere_gen.csv", schema='put')
    parseResult = h2i.import_parse(bucket='my-bucket2', path='dir2/syn_sphere_gen2.csv', schema='put')
    parseResult = h2i.import_parse(bucket='my-bucket3', path='dir3/syn_sphere_gen3.csv', schema='local')
    parseResult = h2i.import_parse(path='/home/kevin/my-bucket2/dir2/syn_sphere_gen2.csv', schema='local')
    parseResult = h2i.import_parse(bucket='my-bucket3', path='/dir3/syn_sphere_gen3.csv', schema='local')

    parseResult = h2i.import_parse(path="testdir_multi_jvm/syn[1-2].csv", schema='local')
    parseResult = h2i.import_parse(path="testdir_multi_jvm/syn[1-2].csv")
    parseResult = h2i.import_only(path="testdir_multi_jvm/syn_test/syn_header.csv")

    parseResult = h2i.import_parse(path="standard/covtype.data", bucket="home-0xdiag-datasets")

Using "header_from_file" can be finessed in one step, but better to use two import steps like this. I'll leave out the subtle issues. 'import_only' is the sub-step (no parse) of import_parse.

    h2i.import_only(path="testdir_multi_jvm/syn_test/syn_header.csv")
    headerKey = h2i.find_key('syn_header.csv')
    # comma 44 is separator
    parseResult = h2i.import_parse(path="testdir_multi_jvm/syn_test/syn[1-2].csv", header=1, header_from_file=headerKey, separator=44)

S3/S3N/HDFS are straighforward. Note that the schema='local' is redirected magically to s3nd when running multi-machine at ec2 for jenkins, due to the node.redirect_import_folder_to_s3_path or node.redirect_import_folder_to_s3n_path created by build_cloud_with_hosts from michal's standard .json

    parseResult = h2i.import_parse(path='standard/benign.csv', bucket='home-0xdiag-datasets', schema='s3n', timeoutSecs=60)
    parseResult = h2i.import_parse(path='leads.csv', bucket='datasets', schema="hdfs", timeoutSecs=60)
    parseResult = h2i.import_parse(path='/datasets/leads.csv', schema="hdfs", timeoutSecs=60)
    parseResult = h2i.import_parse(path='datasets/leads.csv', schema="hdfs", timeoutSecs=60)
    parseResult = h2i.import_parse(path='standard/benign.csv', bucket='home-0xdiag-datasets', schema='s3', timeoutSecs=60)

There should be enough messaging for it to be obvious if you give a bad param. tell me if not so, and I'll enhance.

The layer is pretty lightweight..just some subtle issues due to regex, and finding buckets, hdfs/s3 issues etc.

So can't you just tell me what datasets are used?

Well, running all the tests takes hours. As a shortcut, I can list the "first" dataset a test tries to use. This will generally give you a feel for what's discussed above.

I'm running something that gives me a list of the "first" dataset a test tries to use. This is results so far in testdir_single_jvm.

datasets bucket, part of datasets repo

put://home/kevin/datasets/bench/covtype/h2o/train.csv
put://home/kevin/datasets/logreg/1mx10_hastie_10_2.data.gz
put://home/kevin/datasets/tweedie/AutoClaim.csv
put://home/kevin/datasets/UCI/UCI-large/covtype/covtype.data

smalldata bucket. part of h2o repo

put://home/kevin/h2o/smalldata/badchars.csv
put://home/kevin/h2o/smalldata/covtype/covtype.20k.data
put://home/kevin/h2o/smalldata/datagen1.csv
put://home/kevin/h2o/smalldata/fail1_100x11000.csv.gz
put://home/kevin/h2o/smalldata/hex-443.parsetmp_1_0_0_0.data
put://home/kevin/h2o/smalldata/hhp_107_01.data.gz
put://home/kevin/h2o/smalldata/iris/iris2.csv
put://home/kevin/h2o/smalldata/iris/iris.csv
put://home/kevin/h2o/smalldata/iris/iris_wheader.csv.gz
put://home/kevin/h2o/smalldata/logreg/benign.csv
put://home/kevin/h2o/smalldata/logreg/logreg_trisum_int_cat_10000x10.csv
put://home/kevin/h2o/smalldata/logreg/syn_2659x1049.csv
put://home/kevin/h2o/smalldata/logreg/umass_statdata/cgd.dat
put://home/kevin/h2o/smalldata/logreg/umass_statdata/clslowbwt.dat
put://home/kevin/h2o/smalldata/mixed_causes_NA.csv
put://home/kevin/h2o/smalldata/parity_128_4_2_quad.data
put://home/kevin/h2o/smalldata/poisson/Goalies.csv
put://home/kevin/h2o/smalldata/poker/poker1000
put://home/kevin/h2o/smalldata/poker/poker-hand-testing.data
put://home/kevin/h2o/smalldata/tnc3_10.csv
put://home/kevin/h2o/smalldata/tnc3.csv
put://home/kevin/h2o/smalldata/Twitter2DB.txt

'home-0xdiag-datasets' bucket. import folder and put

local://home/kevin/home-0xdiag-datasets/mnist/*
local://home/kevin/home-0xdiag-datasets/mnist/mnist_reals_testing.csv.gz
local://home/kevin/home-0xdiag-datasets/mnist/mnist_testing.csv.gz
local://home/kevin/home-0xdiag-datasets/standard/benign.csv
local://home/kevin/home-0xdiag-datasets/standard/covtype20x.data
local://home/kevin/home-0xdiag-datasets/standard/covtype.data

put://home/kevin/home-0xdiag-datasets/mnist/mnist_reals_testing.csv.gz
put://home/kevin/home-0xdiag-datasets/mnist/mnist_testing.csv.gz
put://home/kevin/home-0xdiag-datasets/ncaa/Players.csv
put://home/kevin/home-0xdiag-datasets/standard/covtype20x.data
put://home/kevin/home-0xdiag-datasets/standard/covtype.data
put://home/kevin/home-0xdiag-datasets/standard/covtype.shuffled.data

synthetically created on the fly by the test. import folder and put

local://home/kevin/h2o/py/testdir_single_jvm/syn_datasets/*
local://home/kevin/h2o/py/testdir_single_jvm/syn_datasets/*300x100*

put://home/kevin/h2o/py/testdir_single_jvm/syn_datasets/cuse.dat_stripped.csv
put://home/kevin/h2o/py/testdir_single_jvm/syn_datasets/p_43
put://home/kevin/h2o/py/testdir_single_jvm/syn_datasets/parity_128_4_10_quad.data
put://home/kevin/h2o/py/testdir_single_jvm/syn_datasets/parsetmp_1_0_0_0.data
put://home/kevin/h2o/py/testdir_single_jvm/syn_datasets/syn_1713683899168453911_100x11.csv
put://home/kevin/h2o/py/testdir_single_jvm/syn_datasets/syn_1mx8_6019110937119320453.csv
put://home/kevin/h2o/py/testdir_single_jvm/syn_datasets/syn_2253683537843653304_100x11000.csv
put://home/kevin/h2o/py/testdir_single_jvm/syn_datasets/syn_3420966812574800993_50x5000.csv_10x.gz
put://home/kevin/h2o/py/testdir_single_jvm/syn_datasets/syn_3513133353091456167_10x65000.csv
put://home/kevin/h2o/py/testdir_single_jvm/syn_datasets/syn_4346382207036615703_100x50.csv
put://home/kevin/h2o/py/testdir_single_jvm/syn_datasets/syn_4774810011763907704_10x1000000.csv
put://home/kevin/h2o/py/testdir_single_jvm/syn_datasets/syn_5067146759646474139_100x10000.csv
put://home/kevin/h2o/py/testdir_single_jvm/syn_datasets/syn_5173181272107538320_100x50.csv
put://home/kevin/h2o/py/testdir_single_jvm/syn_datasets/syn_5741016246728130996_10000x100.csv
put://home/kevin/h2o/py/testdir_single_jvm/syn_datasets/syn_5904529305084132861_10000x50.csv
put://home/kevin/h2o/py/testdir_single_jvm/syn_datasets/syn_6158176150442020712_100x1.csv
put://home/kevin/h2o/py/testdir_single_jvm/syn_datasets/syn_677592106248814013_10x5000.csv_200x.gz
put://home/kevin/h2o/py/testdir_single_jvm/syn_datasets/syn_7730400065972767396_0_100_10000.csv
put://home/kevin/h2o/py/testdir_single_jvm/syn_datasets/syn_7760876082903418477_100x11.csv
put://home/kevin/h2o/py/testdir_single_jvm/syn_datasets/syn_binary_100000x1.csv
put://home/kevin/h2o/py/testdir_single_jvm/syn_datasets/syn_binary_10000x100.csv
put://home/kevin/h2o/py/testdir_single_jvm/syn_datasets/syn_binary_100x3000.svm
put://home/kevin/h2o/py/testdir_single_jvm/syn_datasets/syn_binary_500000x1.csv
put://home/kevin/h2o/py/testdir_single_jvm/syn_datasets/syn__Bins100_Dmin-1_Dmax1_Cmin-1_Cmax1_Imin-1_Imax1_Ddistunique_pos_neg_4164910064226341101_500x30.csv
put://home/kevin/h2o/py/testdir_single_jvm/syn_datasets/syn_enums_150x1.csv
put://home/kevin/h2o/py/testdir_single_jvm/syn_datasets/syn_enums_200x1.csv
put://home/kevin/h2o/py/testdir_single_jvm/syn_datasets/syn_enums_3x1.csv
put://home/kevin/h2o/py/testdir_single_jvm/syn_datasets/syn_ints.csv
put://home/kevin/h2o/py/testdir_single_jvm/syn_datasets/syn_rooz_enums_4000x999.csv
put://home/kevin/h2o/py/testdir_single_jvm/syn_datasets/syn_twovalues.csv
Clone this wiki locally