-
Notifications
You must be signed in to change notification settings - Fork 16
Create dataset for InfoSphere Streams benchmark
Before you begin, make sure you have prepared the dataset following the steps from here: [Preprocess Enron Email Dataset](Preprocess Enron Email Dataset)
The StreamsPrepareDataset project can be used to create the data set for the Streams benchmark. It reads the email dataset prepared from the previous step, and create a file that stores the emails in InfoSphere Streams binary format.
-
Avro C++: 1.7.4
Installation Guide: http://avro.apache.org/docs/1.7.4/api/cpp/html/index.html
Make sure the include files are located at
/usr/local/include
and shared libraries at/usr/local/lib
-
Boost 1.54.0 (required by Avro)
Installation Guide: http://www.boost.org/doc/libs/1_54_0/doc/html/bbv2/installation.html
Make sure the include files are located at
/usr/local/include
and shared libraries at/usr/local/lib
-
Avro Email Schema File
Create a folder
emailavro
under/usr/local/include
Copy
email.hh
from StreamsAvroOperators/emailavro folder to/usr/local/include/emailavro
You can also alternatively regenerate email.hh by using the Avro C++ compiler:
- Go to folder StreamsAvroOperators/
- Run the following command:
avrogencpp -i email.avsc -o email.hh
A email.hh
file will be generated which you need to copy to the/usr/local/include/emailavro
directory.
To build the application:
- Go to the root directory of StreamsPrepareDataset
- type
make all
at the command line
Before you can run the application, copy the dataset generated from the previous preprocessing step ([Preprocess Enron Email Dataset](Preprocess Enron Email Dataset)) to StreamsPrepareDataset/data
directory.
- Make sure a Streams instance is created and running.
- To submit the job to the Streams instance:
streamtool submitjob -i <instanceName> output/Main/Distributed/Main.adl -P filename=<input filename in data dir>
Samantha: Fix the instructon about input filename here... we cannot put in the extension of the filename.. can only put down filename without file extension.
[Running InfoSphere Streams benchmark ](Running InfoSphere Streams benchmark )