Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low percentile values do not seem to be calculated correctly #20

Open
homiak opened this issue May 24, 2019 · 1 comment
Open

Low percentile values do not seem to be calculated correctly #20

homiak opened this issue May 24, 2019 · 1 comment

Comments

@homiak
Copy link

homiak commented May 24, 2019

Here are the percentile calculated using this library (first column) and percentiles calculated using github.com/montanaflynn/stats library (second column)

Target percentiles:

calcQuantiles = []float64{0.001, 0.01, 0.03, 0.05, 0.1, 0.2, 0.3, 0.4, 0.50, 0.90}

time="2019-05-24T08:53:00Z" level=info msg="Total: 62748 samples"
time="2019-05-24T08:53:00Z" level=info msg="0.1th: 182, 187"
time="2019-05-24T08:53:00Z" level=info msg="1.0th: 4494, 2826"
time="2019-05-24T08:53:00Z" level=info msg="3.0th: 7982, 8024"
time="2019-05-24T08:53:00Z" level=info msg="5.0th: 11528, 11689"
time="2019-05-24T08:53:00Z" level=info msg="10.0th: 15766, 15519"
time="2019-05-24T08:53:00Z" level=info msg="20.0th: 19606, 19680"
time="2019-05-24T08:53:00Z" level=info msg="30.0th: 20909, 20915"
time="2019-05-24T08:53:00Z" level=info msg="40.0th: 21205, 21206"
time="2019-05-24T08:53:00Z" level=info msg="50.0th: 21347, 21350"
time="2019-05-24T08:53:00Z" level=info msg="90.0th: 21470, 21470"

time="2019-05-24T08:54:00Z" level=info msg="Total: 64905 samples"
time="2019-05-24T08:54:00Z" level=info msg="0.1th: 318, 187"
time="2019-05-24T08:54:00Z" level=info msg="1.0th: 4106, 2838"
time="2019-05-24T08:54:00Z" level=info msg="3.0th: 9910, 8039"
time="2019-05-24T08:54:00Z" level=info msg="5.0th: 11709, 11720"
time="2019-05-24T08:54:00Z" level=info msg="10.0th: 15785, 15536"
time="2019-05-24T08:54:00Z" level=info msg="20.0th: 19635, 19685"
time="2019-05-24T08:54:00Z" level=info msg="30.0th: 20962, 20917"
time="2019-05-24T08:54:00Z" level=info msg="40.0th: 21221, 21207"
time="2019-05-24T08:54:00Z" level=info msg="50.0th: 21351, 21351"
time="2019-05-24T08:54:00Z" level=info msg="90.0th: 21466, 21471"

time="2019-05-24T08:54:52Z" level=info msg="Total: 66775 samples"
time="2019-05-24T08:54:52Z" level=info msg="0.1th: 823, 187"
time="2019-05-24T08:54:52Z" level=info msg="1.0th: 3790, 2844"
time="2019-05-24T08:54:52Z" level=info msg="3.0th: 9398, 8046"
time="2019-05-24T08:54:52Z" level=info msg="5.0th: 11577, 11730"
time="2019-05-24T08:54:52Z" level=info msg="10.0th: 15679, 15538"
time="2019-05-24T08:54:52Z" level=info msg="20.0th: 19635, 19683"
time="2019-05-24T08:54:52Z" level=info msg="30.0th: 20907, 20917"
time="2019-05-24T08:54:52Z" level=info msg="40.0th: 21201, 21208"
time="2019-05-24T08:54:52Z" level=info msg="50.0th: 21346, 21352"
time="2019-05-24T08:54:52Z" level=info msg="90.0th: 21466, 21471"

Method used to log the above

   	log.Printf("Total: %d samples", quantiles.Count())
  	for _, quantile := range calcQuantiles {
		percentile, err := stats.Percentile(samples, quantile*100)
		if err != nil {
			log.Errorf("failed to calc percentile - %s", err)
		} else {
			log.Infof("%.1fth: %.0f", quantile*100, quantiles.Query(quantile), percentile)
		}
	}
   }

Please note a big difference in low percentile numbers between the 3 print outs above. Even though one would expect some differences at say 0.1th percentile because 3 print outs above contained different sample numbers. Still there is unexpectedly large variability with only a limited sample number change. Also notice that percentile calculated with github.com/montanaflynn/stats does not display the same variability, which rules out the possibility that the samples added were somehow very different to the previous lot.

So overall I think this library calculates low percentiles incorrectly or the method is not really suitable for this.

I have the data that was used for calculating the above and happy to share it. Just let me know if you want it. (it's less than 1Mb file)

@frioux
Copy link

frioux commented Nov 18, 2019

The whole magic of this algorithm is that instead of calculating exact percentiles it adds an error (called epsilon) to allow more efficient recording of the summaries. The default in this library is 0.01, so a requested percentile of 0.01 is actually somewhere between zero and 0.02. You can request an epsilon of zero but I suspect the cost for that would be too high eventually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants