Create dataset for InfoSphere Streams benchmark

Before you begin, make sure you have prepared the dataset following the steps from here: [Preprocess Enron Email Dataset](Preprocess Enron Email Dataset)

Overview

The StreamsPrepareDataset project can be used to create the data set for the Streams benchmark. It reads the email dataset prepared from the previous step, and create a file that stores the emails in InfoSphere Streams binary format.

Prerequisites:

Avro C++: 1.7.4

Installation Guide: http://avro.apache.org/docs/1.7.4/api/cpp/html/index.html

Make sure the include files are located at /usr/local/include and shared libraries at /usr/local/lib
Boost 1.54.0 (required by Avro)

Installation Guide: http://www.boost.org/doc/libs/1_54_0/doc/html/bbv2/installation.html

Make sure the include files are located at /usr/local/include and shared libraries at /usr/local/lib
Avro Email Schema File

Copy the directory emailavro from StreamsPrepareDataset to /usr/local/include

You can also alternatively regenerate email.hh by using the Avro C++ compiler: avrogencpp email.avsc

email.avsc is present in StreamsPrepareDataset

Compilation

To build the application:

Go to the root directory of StreamsPrepareDataset
type make all at the command line

Set up

Before you can run the application, copy the dataset generated from the previous step to StreamsPrepareDataset/data directory.

Execution

Make sure a Streams instance is created and running.
To submit the job to the Streams instance: streamtool submitjob output/Main/Distributed/Main.adl filename=<input_file_name>

Next Step:

[Running InfoSphere Streams benchmark ](Running InfoSphere Streams benchmark )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly