Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distribution sampling for Zipf and Binomial is too slow #82

Open
aalexandrov opened this issue Mar 25, 2016 · 1 comment
Open

Distribution sampling for Zipf and Binomial is too slow #82

aalexandrov opened this issue Mar 25, 2016 · 1 comment

Comments

@aalexandrov
Copy link
Member

The commons-math3 distributions used in the reference data generator in the archetypes are really slow.

During a local test of an experiment suite on which I am working with @ggevay I am observing the following numbers for generating dataset.A with key cardinality 100000:

  • with Uniform key distribution, the job takes ~ 5 seconds
  • with Binomial key distribution, the job takes ~ 25 seconds
  • with Zipfian key distribution, the datagen job exceeded the allowed limit of 600 seconds.

The fix should be pushed to the peel-wordcount repository (see peelframework/peel-wordcount#1).

@aalexandrov
Copy link
Member Author

I've written a small benchmark (how meta) to highlight the problem.

Here are the results:

##########################################################################################
# Distributions inverse CDF benchmark
##########################################################################################

Cardinality    Uniform        Binomial       Normal         Zipf           Pareto         
------------------------------------------------------------------------------------------
1000           2009.893ms     49049.008ms    1557.039ms     830303.17ms    14294.921ms    
10000          1293.16ms      46005.583ms    1168.293ms     833500.448ms   14307.119ms    
100000         1299.423ms     46028.881ms    1156.179ms     824108.238ms   14183.577ms    
1000000        1276.722ms     45786.145ms    1127.982ms     831676.369ms   14477.197ms    
10000000       1264.654ms     46654.514ms    1142.256ms     835947.764ms   14302.145ms    
100000000      1320.517ms     47075.169ms    1165.2ms       835370.14ms    14216.341ms    

Uniform and Normal distributions are fastest with roughly the same values, Pareto is ~ factor 10 slower, followed Binomial (~ 40x slower), and Zipf (~ 650x slower).

I therefore propose to stick with the continuous probabilities and suitably discretize the inverse CDF values in order to approximate their discrete counterparts. The relationship between these pairs of distributions is explained in [1,2].

[1] Relationship between Binomial and Normal Distributions
[2] Zipf, Power-laws, and Pareto - a ranking tutorial

@aalexandrov aalexandrov added this to the Mar 2016 milestone Mar 25, 2016
@aalexandrov aalexandrov changed the title commons-math3 distribution sampling too slow Distribution sampling for Zipf and Binomial is too slow Mar 25, 2016
aalexandrov added a commit to TU-Berlin-DIMA/flink-hashagg that referenced this issue Mar 25, 2016
@aalexandrov aalexandrov modified the milestone: Mar 2016 Apr 10, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants