-
Notifications
You must be signed in to change notification settings - Fork 0
improving compliance tests #2585
Comments
Reducing the number of rows in the input domain sounds good to me. As well as the part about reducing the number of encoded data sources, as they seem redundant. |
👍 I agree that compliance should be limited to checking... well... compliance, and not try to be a performance test at the same time, and the steps you outline should move towards that. |
It looks like encoded tests are already disabled for many data sources in this pull. I think we could possibly also remove encoded tests for mongodb. WDYT? |
As a case in point, I tried a quick experiment using following names.txt:
In the official version, the user name is a combination of random number of random entries from names.txt. I changed that to use exactly one random entry from the file. With these changes I was able to locally reproduce all the unicode errors I've discovered in the long test mentioned in #2522. Compared to that massive test, this one required only 100 users, and I was able to use all datasources. The compliance test finished in 675 seconds, using only 4 CPU cores on my machine. I didn't reproduce SQL Server mismatches, but I think that could also be done if we generated a small (<= 10) list of random numbers, and then take a random element from that list when generating each numerical entry. Perhaps the same could be done with string values too. Also, as an internal improvement, we could consider using |
🎉 Seems like a good enough reason to go with this to me. |
We have a decently working data generator. It is fairly simple. Let us not spend resources on a rewrite with
I like that idea. Even though this isn't really supposed to be a performance test. All the same it gives an interesting view of how the different data sources compares. |
@sasa1977 can this be considered complete now? |
I still have a bit of work I'd like to do here. I'll try to wrap it up this week. |
@sebastian @cristianberneanu @obrok
Based on my experience reported in #2522 about compliance with a large user set, I'd like to open a discussion about how can we improve the
detected_problems / running_time
ratio of our compliance tests.Currently, our compliance tests seem to conflate two separate concerns: detecting discrepancies between different data backend (including the emulator layer), and informal performance/stress test. I believe that we should treat these cases separately, and in particular, use compliance tests only for the former purpose (detecting discrepancies). I think that this could allow us to reduce the input set for compliance, and therefore improve both the running time of compliance tests as well as the confidence we have in them.
One approach worth considering is to reduce the input domain (.txt files with names and words). Currently, this domain is fairly large, and so some problematic inputs (e.g. the name
┬─┬ノ( º _ ºノ) ( ͡° ͜ʖ ͡°)
) are anonymized away in smaller user sets, and essentially not tested.I think we'd be better off if we used only a couple of names, say one with standard English alphabets, and a couple of more with unicode characters. If we had less than 10 names, I think that it's very probable that every input would surface to the output, even on a smaller test set of 100 users. As a result, every error that can be caused by some input would be detected.
I propose we use the same approach for words used in fields such as note title, content, note change. In addition, I think we should generate less words per each field to reduce the pressure on row splitter functions.
As an added benefit, a smaller data input should also produce a smaller error which is hopefully easier to understand. In the current version, some errors produce a huge left vs right output which is quite hard to analyze.
Finally, I wonder if we need to test the encoded version for every data source. As far as I understand, in the encoded compliance test we simply fetch all the data from the database and then everything else is emulated. We currently use 9 data sources, with plans to add more. If we only used encoded version for one datasource (say the first one), we might cut the running time into half, if not more, given that queries on an encoded data source are usually slower.
In summary, I believe that with the proposed changes we could reduce the compliance running time, improve the our confidence about compliance tests, and make the error output easier to understand. I think that we don't need gazillions of users and a plethora of random inputs to get there.
Thoughts?
The text was updated successfully, but these errors were encountered: