-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distribution sampling for Zipf and Binomial is too slow #82
Comments
I've written a small benchmark (how meta) to highlight the problem. Here are the results:
I therefore propose to stick with the continuous probabilities and suitably discretize the inverse CDF values in order to approximate their discrete counterparts. The relationship between these pairs of distributions is explained in [1,2]. [1] Relationship between Binomial and Normal Distributions |
The
commons-math3
distributions used in the reference data generator in the archetypes are really slow.During a local test of an experiment suite on which I am working with @ggevay I am observing the following numbers for generating
dataset.A
with key cardinality 100000:Uniform
key distribution, the job takes ~ 5 secondsBinomial
key distribution, the job takes ~ 25 secondsZipfian
key distribution, the datagen job exceeded the allowed limit of 600 seconds.The fix should be pushed to the peel-wordcount repository (see peelframework/peel-wordcount#1).
The text was updated successfully, but these errors were encountered: