Skip to content
srisatish edited this page Sep 2, 2013 · 5 revisions

Example Coding tests/tasks for future hackers @ 0xdata or OpenSource Collaborators. __ All Contributions will be recognized and your name added as Collaborators.

(TBD - Attache JIRA-### to this)

1. "Python REST Quantile Github Task Description

Objective:

Add a quantile algorithm to H2O and call it from Python using the REST API.

  • Implement a quantile algorithm in H2O by building on top of H2O’s existing MapReduce. Calculate for the following splits: 5%, 10%, 15%, 85%, 90%, 95%. The algorithm should require only a fixed number of passes over the data (ideally 1) to calculate N quantiles. Calculating an approximation is fine (provide a reference to the algorithm you chose).

  • Add a REST API to call the new algorithm. This should provide a JSON response.

  • Write a Python script to access the new algorithm via HTTP and parse the JSON.

  • Run H2O quantile on a big dataset (say, 100M rows) using distributed H2O (at least three nodes) using Python (use the REST API). (Write a script to generate a big data set randomly and provide the script and random seed for repeatability.)

  • Print quantile JSON response result from Python.

  • Extra credit: Add a web page for use by the browser. (This can be painful. This is really extra credit!)

Notes:

  • The quantile algorithm should take as input an already-parsed dataset in memory (a HEX key). The output can be rolled up into a JSON response directly, if you like. It’s not a requirement to store the output in the Distributed Key/Value store.

  • The REST API port default is 54321. (This is the same as the browser port.)"

2. RESTShuffleRows Github Task Description

Objective:

Add a shuffle algorithm to H2O and call it using the REST API. Use the FluidVector API (in java/water/fvec), not the ValueArray API.

(See java/water/api/DRF2.java as a starting point.)

  • Implement a shuffle algorithm in H2O. This should take an input HEX key as the source dataset and produce an output HEX key as the destination dataset. It is OK to copy the data to the output dataset; don’t worry about trying to shuffle in place.

  • Add a REST API to call the new algorithm. Define and implement an appropriate JSON response.

  • Run shuffle on a big dataset (say, 10M rows) using distributed H2O (at least three nodes). Drive this using the REST API. (Write a script to generate a big data set and provide the script for repeatability.)

  • Use the (Beta/Fluid Fecs!)->Inspect Web UI menu item to confirm your changes visually.

  • Extra credit: Add a web page for use by the browser. (This can be painful. This is really extra credit!)

Notes:

  • The shuffle algorithm should take as input an already-parsed dataset in memory (a HEX key). The output should be a new HEX key which corresponds to a shuffled copy of the original dataset.

  • The REST API port default is 54321. (This is the same as the browser port.)

3. Implement Distributed Matrix Multiplication on H2O

Add a matrix multiplication algorithm to H2O and call it using the REST API. Use the FluidVector API (in java/water/fvec), not the ValueArray API.

  • Add a REST API to call the new algorithm. Define and implement an appropriate JSON response.

  • Run MatixMultiply on a big dataset (say, 10M rows) using distributed H2O (at least three nodes). Drive this using the REST API. (Write a script to generate a big data set and provide the script for repeatability.)

  • Use the (Beta/Fluid Fecs!)->Inspect Web UI menu item to confirm your changes visually.

  • Extra credit: Add a web page for use by the browser. (This can be painful. This is really extra credit!)

Clone this wiki locally