-
Notifications
You must be signed in to change notification settings - Fork 16
Running Apache Storm benchmark
Before you begin, make sure you compiled the application and created the required dataset: [Create dataset for Apache Storm benchmark](Create dataset for Apache Storm benchmark)
The Apache Storm benchmark contains the following topologies:
-
EnronTopology
: Complete application benchmark -
BareboneTopology
: Same asEnronTopology
but filter, modify, and metrics are unity bolts -
TrivialTopology1
: Same asBareboneTopology
but filter and modify bolts are removed -
TrivialTopology2
: Same asTrivialTopology1
but serialization and deserialization bolts are removed This topology requires an unserialized but compressed dataset (create usingcom.ibm.streamsx.storm.email.benchmark.testing.CreateCompressedDatasetSequential
) -
RestrictedTopology
: Same asTrivialTopology2
but without compression and decompression This topology requires an uncompressed dataset (create usingcom.ibm.streamsx.storm.email.benchmark.testing.CreateSerializedDatasetSequential
)
- Each spout requires its own dedicated input file
- The first spout would get
name0.ext
, the secondname1.ext
, and so on. Therefore, the dataset naming convention isname<n>.ext
, where n starts from 0 and goes up till m-1 where, m is the parallelism of the spout - The dataset should be present on NFS
Copy conf/storm.email.properties
to your home folder (~/) and fill out the missing values
At a minimum, these values will need to be filled in:
-
logspath
: path to custom logs folder -
filepath
: path to input file -
filename
: name of the input file -
fileext
: extension of the input file
For instance, if you have four spouts, then you should have name0.ext
, name1.ext
, name2.ext
, and name3.ext
on your NFS. filepath
would point to the folder where these 4 files are located, filename
would be name
, and fileext
would be .ext
The parallelism of the pipeline can be varied by changing the values of the *spout and *bolt keys in the configuration file
- The final metrics are emitted by the Global Metrics Bolt
- To this end, it needs to know the total number of emails. It gets this number from the configuration file (
totalemails
) - This number needs to be updated each time the dataset changes. Just uncomment the
totalemails
for the corresponding dataset in the configuration file
The code works with Storm 0.8.2 and 0.9.0.1. The version is selected by setting a) storm.version
in pom.xml
and b) stormversion
in storm.email.properties
storm jar target/storm-email-benchmark-1.0-jar-with-dependencies.jar com.ibm.streamsx.storm.email.benchmark.<topology_name> <local_or_remote> <job_id>
In local mode, Storm will be executed as a standalone local application
In remote mode, an existing Storm deployment will be used
These topologies make use of vanilla shuffle grouping. If you want to use localOrShuffle group instead, use com.ibm.streamsx.storm.email.benchmark.local.<topology_name>
.
For some setups, especially single process ones, shuffle seems to perform better than localOrShuffle.
Final number of characters, words, and paragraphs, throughput, elapsed time, and number of processed emails can be retrieved from <logspath>/<job_id>/GlobalMetricsBolt_Final
See Configuration section above for details of "logspath"
- Interval metrics can be obtained from
<logspath>/<job_id>/GlobalMetricsBolt
and<logspath>/<job_id>/GlobalMetricsBolt_Throughput
- To collect CPU Time after the job has completed:
-
jps
Note down the PIDs of all Worker processes - For each Worker PID
ps -e -o pid,cputime | grep <pid>