Performance Alert message on latest commits #6730

khsrali · 2025-01-24T16:46:19Z

The performance of this particular test is getting slower and slower..

/tests/benchmark/test_json_contains.py::test_wide_json

first warning:
d0c9572#commitcomment-151682807
second warning:
eba6954#commitcomment-151687906
third warning:
208d6a9#commitcomment-151735963

@rabbull should we ignore this? any idea why it has performed super fast on your PR 😅

pinning @unkcpz who noticed this

The text was updated successfully, but these errors were encountered:

rabbull · 2025-01-24T17:58:06Z

Sorry I didn't expect that benchmarks will be run in every PR..

The fact is this benchmark is always slow, also in my PR. The purpose of it is to compare, in extreme data sizes, the performance difference between SQLite's implementation of json contains with PostgreSQL's built-in implementation, which finds the prior one is about ~2x slower than the later one. This is acceptable.

Is it possible to exclude this benchmark from being run frequently? Or it's also reasonable to just remove this benchmark as it doesn't need to exist in the test suite, as its mission is already accomplished.

EDIT: Let me just remove this benchmark. It doesn't make too much sense to keep it along with the test suite.

EDIT2: It still make some sense. I will look into this problem.

rabbull · 2025-01-25T01:25:28Z

TLDR: The test is flaky due to randomized data and database being sensitive to fluctuations. Fixing the random seed helps, but doesn't fully solve the issue. Adopting a better metric like median instead of OPS is a better option.

I performed the benchmark a few times locally and found that the test is indeed flaky -- the time consumption of a single case can vary significantly, ranging from 5 ms to 37 ms, a ~7x difference. I plotted the histogram of the rounds for this case, and it shows a long-tailed distribution. Most of the rounds are quite fast, but a few outliers are significantly slower. The median of this case is 5 ms, and the average is 6 ms, with both being close to the minimum. Below is the histogram (x-axis: time consumption, y-axis: number of occurrences):

I identified two possible factors contributing to this issue:

the benchmark relies on randomized data generation, leading to different test data for every round. This introduces significant variability in performance across rounds.
the operation involves a heavy database query (finding rows with JSON data in some column satisfying a condition). Database queries like this are inherently sensitive to fluctuations, even after warmup, due to factors such as caching and CPU/IO load.

The first problem can be addressed by fixing the seed before generating randomized data. I have implemented this in a local branch, and it has reduced the variability. However, while the variability is lower, the maximum remains high, and the stddev is still unacceptable, which leads us to the second issue.

The second problem is harder to eliminate entirely. As database operations can be inherently variable, the best we can do is to adopt better metrics for monitoring. Some promising alternatives include using median or average instead of relying on OPS, which is overly sensitive to such fluctuations.

My conclusion is that this warning can be ignored for now, as it does not directly impact the functionality. However, I recommend replacing OPS with a better metric such as median to improve monitoring reliability in the future. In the meantime, I can submit a PR to add a fixed seed for randomized data generation to reduce flakiness.

khsrali · 2025-01-29T09:53:16Z

Wow, thanks a lot @rabbull for the detailed analysis!

I see the problem now. Since GitHub's computation resource is not that reliable, sometimes is slower/faster based on availability/load/etc, I suggest

to introduce a nighty test, that executes this particular test (or the entire test suit?) several times (5-10), and simply just get the smallest number of those runs and compare that with the benchmark number.
How about this?

rabbull · 2025-01-30T08:33:11Z

By

introduce a nighty test, that executes this particular test (or the entire test suit?) several times (5-10), and simply just get the smallest number of those runs and compare that with the benchmark number

did you mean selecting the run with highest OPS among 5-10 runs? Or selecting the fastest iteration in all iterations of the several full runs?

khsrali · 2025-01-30T09:10:31Z

yes, the second!
Selecting the fastest iter/sec of the several runs

rabbull · 2025-01-30T11:10:06Z

yes, the second! Selecting the fastest iter/sec of the several runs

This looks promising to me. However do we have proper tools for such cases?

khsrali · 2025-01-30T11:13:24Z

nope :) we don't.
We have nightly tests on place. But it will require some modifications to go with this.

khsrali · 2025-02-17T13:08:11Z

@rabbull the performance message is back:
ce9dcf4

Just checking if we decided to switch it off, in the end? I don't remember, can you remind me :)

rabbull self-assigned this Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Alert message on latest commits #6730

Performance Alert message on latest commits #6730

khsrali commented Jan 24, 2025 •

edited

Loading

rabbull commented Jan 24, 2025 •

edited

Loading

rabbull commented Jan 25, 2025

khsrali commented Jan 29, 2025

rabbull commented Jan 30, 2025

khsrali commented Jan 30, 2025

rabbull commented Jan 30, 2025

khsrali commented Jan 30, 2025

khsrali commented Feb 17, 2025

Performance Alert message on latest commits #6730

Performance Alert message on latest commits #6730

Comments

khsrali commented Jan 24, 2025 • edited Loading

rabbull commented Jan 24, 2025 • edited Loading

rabbull commented Jan 25, 2025

khsrali commented Jan 29, 2025

rabbull commented Jan 30, 2025

khsrali commented Jan 30, 2025

rabbull commented Jan 30, 2025

khsrali commented Jan 30, 2025

khsrali commented Feb 17, 2025

khsrali commented Jan 24, 2025 •

edited

Loading

rabbull commented Jan 24, 2025 •

edited

Loading