Benchmark tools for CHiSEL.
This package includes utilities for generating, driving, and plotting benchmark tests. Each module is described below. The modules include a main
routine. To learn about their options run python ./chiselbenchmark/<module>.py -h
from the command-line, where <module>
is in [generator, driver, plotter]
.
The Python module chiselbenchmark.generator
may be used to generate a table of data with different characteristics of denormalization. The main routine requires an argument for the number of rows to be generated. All other parameters are optional. By default it will generate a dataset with a variety of all possible characteristics (simple typed columns, unconstrained "term" columns, denormalized "term" list columns, repeated embedded subconcepts within the main entity). Data are written to stdout
so redirect to save to a file. To run the generator programmatically, import the module and invoke help(entities)
to learn more.
Note that the generator
requires an input file, looked for by default at ~/terms.txt
, with a line-delimited list of "terms" to be used by the generator. The terms will be sampled randomly for use in generated term
and termlist
columns.
To run the command to generate 1000000
row dataset, this command:
python ./chiselbenchmark/generator.py 1000000 >~/1000000.csv
The Python module chiselbenchmark.driver
may be used to drive the test cases against the datasets generated by the generator. Options include the set of test cases to be run, the parameters for the test cases, and the conditions to be tests. The required arguments are number of rounds per test and list of datasets by "name" where name can be a filename without extension, if connecting to the default local file catalog. Without further arguments, the script will run all default test cases (see -h
for defaults), both conditions, and all default parameters of each test case. Output are written to stdout
and stderr
so redirect them to files to save the output.
Note that the driver
requires a local file system data catalog, looked for by default at ~/benchmarks
. The generated datasets (i.e., 1000.csv
, 10000.csv
, etc.) should be saved directly underneath this directory. Results of each round of tests will be saved under ~/benchmarks/output
and deleted during teardown. To debug, the option --disable-teardown
may be used.
To run the script with the defaults, with 10
rounds each test, using datasets 1000
, 10000
, and 100000
, this command:
python ./chiselbenchmark/driver.py 10 1000 10000 100000 2>~/error.log >~/results.csv
To runt he script for one example testcase (more than one are allowed), using dataset 1000
, with only 3
rounds per test, and using a single param 1
, this command:
python ./chiselbenchmark/driver.py 3 1000 --testcases create_vocabulary_then_align_and_tag --params 1 2>~/error.log >~/results.csv
The Python module chiselbenchmark.plotter
may be used to plot results from the driver. The main routine requires the filename of the results to be plotted. The format should conform to the output produced by the driver
module. Either or both of --show
and --save
should be specified in order to display the plots and/or save the figures to files. Other options include the output format (any accepted by matplotlib
), dots-per-inch (dpi), and y-units timescale (s
or ms
). When saving, the files will be named according to test_case.format
in the current working directory.
To plot results.csv
and show
and save
the figures, this command:
python ./chiselbenchmark/plotter.py ~/results.csv --show --save