Skip to content
tomkraljevic edited this page Dec 4, 2013 · 3 revisions

THIS PAGE IS OBSOLETE

Current urls that do things

Substitute your IP address for the 192.168.1.17 shown (and port address if different) All urls are prefixed with a H2O ip address plus HTTP port, like http://192.168.1.174:55318 or http://localhost:55318

/StoreView or (StoreView.json)
/Remote?Node=/192.168.1.17:54328
/GetQuery
/Get?Key=  (or Get.json)
/GetFile
/Put
/PutValue?Value=&Key=&RF= (or PutValue.json)
/PutFile (or PutFile.json)
/Upload
/Timeline (or Timeline.json)
/ImportQuery
/ImportFolder?Folder=&RF= (or ImportFolder.json)
/ImportUrl?Url=&Key=&RF= (or ImportUrl.json)
/ProgressView
/Network or (Network.json)
/Shutdown or (Shutdown.json)
/Parse?Key= or (Parse.json)
/GLM
/RF
/RFView
/RFTreeView
/LR

To get stacktraces of the nodes

/DbgJStack or (DbgJstack.json)
/DebugView
/DebugView?Prefix=
/Jstack  or (Jstack.json)

Keys

H2O maintains persistent objects in it's own private "ice" host filesystem structure. All datasets are loaded into H2O into keys, before they can be manipulated. The result of H2O operations can create more keys. Keys exist on each node, but they are accessible to all nodes. There is finer granularity of these objects below the abstraction presented to the user. The manipulation of these key abstractions is central to how multiple nodes cooperate on a single task: they share source data and results with keys.

The main thing for the user understand is that keys are a flat name space at the user level api. There is no need to understand where the data actually exists on physical nodes, when working at the api level.

json vs url descriptions

H2O has a browser interface that operates on top of the json interface. The browser interface sometimes hides some of the steps done in pure json-style interaction. Most of the json api, can be used as a browser url by adding the suffix ".json". Strictly speaking, only encoded urls are legal for a browser, since different browsers may or many not support characters like "/", "," or ":". In this documentation, we currently show unencoded urls. Your browser may allow you to use them directly without encoding. Additionally, the json api is shown sometimes as a url, which can work in a browser (results of a ".json" url return text to the browser.

Files to/from H2O

A PutFile will create a key for you if you don't specify one. You can put from a normal network path, or hdfs://192.168.1.151/ to go to the 0xdata hdfs, thru the 192.168.1.151 namenode. homer.0xdata.loc can also be used for the namenode.

A GetFile currently sends the file to the browser, which creates a popup window like the normal "Save As" thing in a browser. You can specify where to save the file locally then.

Current restriction on key creation: write-once, read many.

All PutFile and Parse operations, which create H2O keys, need to specify a new, previously unused, key name as the destination. You will not get a desired overwrite behavior if you reuse a previously used key.

For example, if you PutFile a file, and then modify the file outside H2O somehow, and then PutFile again with the same destination key name. The operation will complete without error, but you will not have the new contents of the file. Similarily, if you Parse a file to a key, and then Parse with a different source to the same key, you will not get an updated result.

When using Exec, a default key called "Result" is used for the exec result value. This can be a single row/single column key, or many rows. This is a case where a key that's user-accessible can be written multiple times.

ImportFolder

The normal sequence for using ImportFolder is:

  1. Importfolder

  2. Inspect

  3. Parse

  4. RF+RFView, or GLM. RF can be started with RFQueryBuild, if in a browser.

ImportFolder creates keys differently than PutFile. In the browser, this is easy, because the "right" form is always provided to you: you just click.

The json interface requires you to match what happens in the browser.

ImportFolder can be a little confusing because it creates/uses key names that are generated from the full path to a file.

Warning on unix systems: if you use symbolic links for the path to the folder, H2O will resolve that to the real pathname. If the real path is not the same on all nodes, ImportFolder won't work. (This may be an issue in complicated file systems).

This url gives gives you a list of the files you can inspect or parse, from the Folder. The key names are created from the full path name to the file in the Folder, along with nfs:// prepended. RF is used for "replication factor" and doesn't need to be added.

http://192.168.0.31:55321/ImportFolder?Folder=/home/0xdiag/datasets&RF=

The json version will return two dictionaries: one with files that "failed" for some reason (usually not the same file on all nodes). The other is a list of files that can be loaded.

In this example return data, note that file path names are returned, This path names can't be used by themselves for any further json URLs, and need to be extended at beginning or end, as shown in the next section

SUGGESTION: rather than file names, which are not usable in H2O here, maybe the json should return strings with "nfs://" prepended, which makes them usable.

{
"failed":[],
"ok":[
    "/home/0xdiag/datasets/new-poker-hand.full.311M.txt.gz",
    "/home/0xdiag/datasets/covtype20x.data",
    "/home/0xdiag/datasets/billion_rows.csv.gz",
    "/home/0xdiag/datasets/michal/train.csv",
    "/home/0xdiag/datasets/michal/test.csv",
    "/home/0xdiag/datasets/covtype.data",
    "/home/0xdiag/datasets/make200",
    "/home/0xdiag/datasets/covtype200x.data"],
"imported":8
}

This doesn't create any keys. If you do it in a browser, you'll get a link that's right for parsing the file (if you click), and some info about the file

http://192.168.0.31:55321/Inspect?Key=nfs://home/0xdiag/datasets/covtype.data

This asks H2O to specify a default key name for the parsed data (Key2)

http://192.168.0.31:55321/Parse?Key=nfs://home/0xdiag/datasets/covtype.data

Usually you will prefer to specify short key names.

To better match what you get with PutFile, which doesn't include the full path name, you probably will want to drop the full path name for the Key2.hex suffix is a required suffix for Key2

http://192.168.0.31:55321/Parse?Key=nfs://home/0xdiag/datasets/covtype.data&Key2=covtype.data.hex

If you don't specify Key2, it is created by H2O this way, from the first Key:

http://192.168.0.31:55321/Parse?Key=nfs://home/0xdiag/datasets/covtype.data&Key2=//home/0xdiag/datasets/covtype.hex

H2O will drop any .gz suffix or any other .suffix at the end of Key (just the last one) when creating Key2.

http://192.168.0.31:55321/Parse?Key=nfs://home/0xdiag/datasets/billion_rows.csv.gz&Key2=//home/0xdiag/datasets/billion_rows.csv.hex

Or another with .data dropped. This is for referencing covtype.data in the Folder. The created key will be called //home/0xdiag/datasetet/covtype.hex. (Key2 is not required to be specified directly. H2O will create the name with it's rules)

http://192.168.0.31:55321/Inspect?Key=nfs://home/0xdiag/datasets/covtype.data

If the file has already been parsed, the browser will point you to the parsed result rather than the file:

http://192.168.0.31:55321/Inspect?Key=//home/0xdiag/datasets/covtype.hex&Key2=//home/0xdiag/datasets/covtype.hex

Here, ".csv" is dropped when creating H2O creates Key2 name:

http://192.168.0.31:55321/Parse?Key=nfs://home/0xdiag/datasets/michal/train.csv&Key2=//home/0xdiag/datasets/michal/train.hex

Current issues to watch out for:

There is caching of keys in H2O. Reloading a file with the same key name will likely not get any changed contents of a file. Workaround: always use a new unique key when loading a file.

Random Forest

RF and RFView both get some shared parameters that need to be identical. Just listing current parameters for now:

tip: Can use browser history to see what H2O browser does behind the scenes, since only the RFView url will be visible after using the RFBuildQuery page and clicking on "create the confusion matrix"

tip: Can use http://meyerweb.com/eric/tools/dencoder to decode the javascript url back to original. The Javascript encoded urls are visible with the h2o browser, but for json, the non-encoded versions should be passed.

classWt and ignore can refer to column names or numbers? (numbers should be 0 start like GLM?)

Can pass the full list to both random_forest and random_forest_view. They will sort out what they need. (extras are ignored?)

ISSUE: have to add detail on which have defaults, and optional vs required.

Current RF parameters. For the matching ones, they have to be identical to RFView? (is this ever not true?)

They may be different. If they are not present, they are taken from the model itself.

RF?
classWt=& ... weights of classes. list of assignments to class names. Affects tree building.
class=2&
ntree=&
modelKey=&
OOBEE=true& ...out of bag evaluation
gini=1&
depth=&
binLimit=&
parallel=&  ...debug only? ...
ignore=&   ...comma separated list of columns to ignore
sample=&
seed=&
features=1& ... number of features to consider
stratify=1 ... flag turning stratified sampling on/off&
strata=1:5 ... Optional. Comma separated list of cid:n, cid is index of a class and n is the absolute number of samples.
Key=chess_2x2_500_int.hex
Unbalanced datasets

Currently H2O has 3 mechanisms to help improve per-class error weights on unbalanced datasets:

  1. Weighting by output class when creating trees. The weights multiply the effects of rows with that output class, "as if" there were more copies of that row>

  2. Weighting by output class during voting, while creating the confusion matrix or predicted results. If there are trees with different values, weighting a class will make it act "as if" more trees existed that voted for that value.

  3. Stratified Sampling: Used when sampling datapoints for building trees. With stratified sampling, each class is sampled independently so that the number of samples from each class can be controlled. User can request specific number of samples per class (strata) to override default values. Default values will be computed so that each node has the same ratio of classes as is in the whole dataset. In case some class has too few samples in the whole dataset (< 1/512 of the majority class) default strata will be computed as total number of samples of this class. Stratified sampling can cause samples to be multiplied across several nodes in case there is not enough datapoints to satisfy the requested (default) strata on some nodes.

Stratified sampling can decrease overall error rate as it effectively undersamples majority classes.

Current RFView variables

Ignores can be specified, but if they are not, those from the model are used. Best to let it forward from the model. (ISSUE: if we can save the ignores, why not forward the rest of the information in the model also?)

RFView?
classWt=classWt=B=1,W=2&  (Note: not passed from model. Affects tree voting)
class=2&
ntree=50&
modelKey=model&
OOBEE=true&
ignore=&   
dataKey=chess_2x2_500_int.hex

RFView should only get these parameters?

classWt=classWt=B=1,W=2&  (Note: not passed from model. Affects tree voting)
modelKey=model&
OOBEE=true&  (or false)
dataKey=chess_2x2_500_int.hex
Examples

The H2O browser uses repeated ignores which are accumulated. This could have been done with a single comma separated list of values to ignore

http://192.168.1.174:55318/RF.json?Key=poker-hand-testing.hex&classWt=&class=10&ntree=&gini=1&depth=3&binLimit=&sample=&seed=&features=&OOBEE=true&ignore=0&ignore=1&ignore=2&ignore=3&ignore=4&ignore=5&ignore=6&ignore=7&ignore=8&modelKey=

http://192.168.1.174:55318/RFView.json?ntree=50&modelKey=model&class=10&classWt=&OOBEE=true&dataKey=poker-hand-testing.hex
Looking at individual trees

n=9 and n=10 are trees. Starts with n=0

ISSUE: apparently json not supported yet

http://192.168.1.174:55318/RFTreeView?modelKey=model&n=9&dataKey=poker-hand-testing.hex&class=10
http://192.168.1.174:55318/RFTreeView?modelKey=model&n=10&dataKey=poker-hand-testing.hex&class=10

Linear Regression

http://192.168.0.37:54321/LR?Key=____aad91-3d73-436e-bd44-86e356950633

&colA= and %colB= can be added to specify two feature columns. The two columns should cover more than one point, otherwise, NaNs will be returned.

Logistic Regression

To run logistic/linear regression via REST API, go to /GLM and supply following arguments:

  • Key - (required) - key of the .hex data to run regression on
  • Y - (requited) - column id / name of the response variable, if you run logistic regression it's values currently MUST be from {0,1}
  • X - (optional) - list of columns used as an input vector, either names or column indexes (or mixture of both), separated by ','. Ranges can be provided by supplying two values separated by ':', e.g. X=1:3,5:10 is equivalent to X=1,2,3,5,6,7,8,9,10. Default is all columns except for Y.
  • -X - (optional) - negative X, these columns will be subtracted from the set of selected columns. Thus, selected columns = X \ -X. Default is empty set.
  • family - (optional) - either "binomial" for logistic regression or "gaussian" for linear regression. Default is gaussian.
  • xval - (optional) - cross validation factor. suply 10 for 10fold cross validation. Default is 0 (no crossvalidation). Currently supported only for logistic regression.
  • threshold - (optional) - decision threshold used for prediction error rate computation. Only used by logistic regression. Default is 0.5.
  • norm - (optional) - norm for regularized regression. Possible values "L1", "L2", "NONE". Default is "NONE". If regularization is used, data will be REGULARIZED so that it has zero mean and unit variance.
  • lambda - (optional) - argument affecting the regularization, Default is 0.1.
  • rho - (optional) - used only with L1. Enhanced Lagrangian parameter. Default is 1.0.
  • alpha - (optional) - used only with L1. Over-relaxation parameter (typical values for alpha are between 1.0 and 1.8). Default is 1.0.

example URLs:

http://127.0.0.1:54321/GLM?Key=____prostate.hex&X=6,3&Y=1&family=binomial 

runs logistic regression using columns # 6 and #3 as X vector and column # 1 as response variable

http://127.0.0.1:54321/GLM?Key=____prostate.hex&Y=CAPSULE&family=binomial&xval=10&threshold=0.4

All the columns are used for regression, except for Y (CAPSULE). 10 fold crossvalidation will be produced using decision threhsold 0.4.

Clone this wiki locally