-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Additional asv tests #2185
base: master
Are you sure you want to change the base?
Additional asv tests #2185
Conversation
@@ -664,6 +665,322 @@ def clear_symbols_cache(self): | |||
lib._nvs.version_store._clear_symbol_list_keys() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have encountered a few confusing things not related to the PR itself but the previous PR. They are small but I think would be good to address in this PR while we're still at this? Adding as replies here as I can't add a comment to a non-edited part :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the get_library_names
is a bit confusing when using. We have in a few places code like:
self.get_library_names(suffix)[0]
which is confusing because what does the [0]
stand for?
What do you think about instead passing an argument to the function whether you want a modifiable or a persiatant library? E.g. I think this would be more readable:
self.get_library_name(suffix, lib_type=LibraryType.PERM)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea! Makes lots of sense and will be clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you've created more functions get_library_name
and get_modifiable_library_name
and kept the old one.
I meant just replacing the original get_library_name
to take an argument from a new enum type. It should simplify this.
return next | ||
|
||
|
||
class GeneralSetupOfLibrariesWithSymbols(EnvConfigurationBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally find the names of all these EnvironmentConfiguration
s a bit confusing.
All of them start with General
which I think just makes the names longer.
Also they live inside environment_setup
so when importing environment_setup.General...
it's clear that it is for a setup, so maybe let's also drop the setup from the naming.
So what do you think about the following renames:
GeneralSetupLibraryWithSymbols
-> SingleLibrary
GeneralSetupSymblsVersionsSnapshots
-> LibrariesWithVersionAndSnapshots
GeneralUseCaseNoSetup
-> NoSetup
GeneralAppendSetup
-> LibraryWithAppendData
GeneralSetupOfLibrariesWithSymbols
-> LibrariesPerNumSymbols
To indeicate each library is for a specific number of symbols.
I haven't thought too much about the names but it would be good to know e.g. this class will maintain many libraries and different libraries will have different number of symbols.
(list_rows, list_cols) = self._get_symbol_bounds() | ||
|
||
for num_symbols in self._params[self.param_index_num_symbols]: | ||
lib = self.get_library(num_symbols) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will raise if the library does not exist? Should we have a try catch to return False
if something in the check raised?
I think that holds for all check_ok steps, so maybe the try catch can live in the setup_environment
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you missed this one?
start = time.time() | ||
for sym_num in range(symbols_number): | ||
for row in list_rows: | ||
for col in list_cols: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like surprising behavior to me. Won't we generate num_symbols * num_rows * num_cols
instead of just num_symbols
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You missed this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You had a few missed comments.
Two large notes:
- As discussed offline with Georgi we should simplify this framework quite a bit as it feels overengineered. But can probably leave this as a separate task to get these new benchmarks merged
- Could you please test that all of the new bencharms you created are running in the CI and are working? I think there are some bugs in them and they would just fail
- I think all of the
real
benchmarks are copy pasted from the previous ones. Is the plan to remove the old ones?
""" | ||
Returns a date range selecting last X% of rows of dataframe | ||
pass percents as 0.0-1.0 | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is probably useful in general and we should move to utils.py?
self.df: pd.DataFrame = self.setup_env.generate_dataframe(num_rows, self.setup_env.default_number_cols) | ||
|
||
#Construct read request with date_range | ||
self.date_range = self.get_last_x_percent_date_range(num_rows, 0.05) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the intention of this date_range? It should be a tuple of start, end date. I think the get_last_x_percent_date_range
returns a list of dates with an equal frequency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually surprised this benchmark even works. Have you tried running them?
ideally should use process id as identification for symbol, library etc. | ||
|
||
Typical use is to generate 4 or more dataframes. With first you will initiate | ||
write to a symbol and with next you can do appends poping out from list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo poping
assert len(sequence_df_list) > 0 | ||
start = sequence_df_list[0].head(1).index.array[0] | ||
last = sequence_df_list[-1].tail(1).index.array[0] | ||
return (start, last) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't df.index[0], df.index[-1]
give you these?
self.__init_time_number = TimestampNumber.from_timestamp(initial_timestamp, self._frequency) | ||
return self | ||
|
||
def set_frquency(self, freq): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo frquency
""" | ||
Sets up multiple libraries, each containing specified number of symbols | ||
and each symbols having specified row numbers (col numbers can be defined also) | ||
Accepts ASV params:" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo :"
else: | ||
start = start_timestamp | ||
|
||
df = DFGenerator.generate_wide_dataframe(num_rows=number_rows, num_cols=self.__number_cols, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to make sure these generated a good variety of strings, including unicode strings, which we have to handle very differently.
assert setup.check_ok() | ||
|
||
#Changing parameters should trigger not ok for setup | ||
setup.set_with_metadata_for_each_version() | ||
setup.set_with_snapshot_for_each_version() | ||
setup.set_params([2, 3]) | ||
assert not setup.check_ok() | ||
|
||
@classmethod | ||
def test_setup_multiple_libs_with_symbols0(cls): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More meaningful name than test_setup_multiple_libs_with_symbols0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just parking this review for now while you and Ivo discuss the design
|
||
assert not setup.check_ok() | ||
setup.setup_environment() | ||
#assert setup.check_ok() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commented out code
ln = setup.get_library_name(LibraryType.PERSISTENT, num_syms) | ||
lib = ac.get_library(ln) | ||
symbols_list = lib.list_symbols() | ||
setup.logger().info(symbols_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How big are the symbol lists? This log output could be huge
for num_syms in params[0]: # the symbols list | ||
ln = setup.get_library_name(LibraryType.PERSISTENT, num_syms) | ||
lib = ac.get_library(ln) | ||
symbols_list = lib.list_symbols() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember that making this call will compact the symbol list cache, so may affect any benchmarks of symbol list performance
assert setup.has_library(num_syms) | ||
lib = setup.get_library(num_syms) | ||
symbol = setup.get_symbol_name(num_syms - 1, rows, cols) | ||
assert lib.read(symbol).data is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assertion seems quite slow and expensive for what it tells us. If you have to have it, you should at least call head
rather than reading the entire symbol
|
||
class AWSQueryBuilderFunctions: | ||
""" | ||
This is same test as :LocalQueryBuilderFunctions:`LocalQueryBuilderFunctions` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there really no way to share the code between them, for example by pulling out a subclass?
AWS30kColsWideDFLargeAppendDataModify.number) | ||
|
||
|
||
class LMDB30kColsWideDFLargeAppendDataModify(AWSLargeAppendDataModify): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very odd, why is there an LMDB class inheriting from an AWS class?
#### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> The test creates has a special fixture to setup needed environment (not in test as it slows extremely execution under memray) - creates library with predefined by test library options, then adds a symbol and creates many versions and a snapshot for each version. The dataframes created are growing in size when library is dynamic. The test runs is executed 2 times with different library options for segments size, dynamic type on/off, and encoding version. It covers head and tails executions with different parameters over different types of versions/snapshots of a symbol NOTE: utils.py is not part of this PR. It is part of #2185 but is there to reuse code as the other is not yet merged Additional notes: Why Linux threshold is so high see - 3.11 - https://github.com/man-group/ArcticDB/actions/runs/13517989453/job/37771476992?pr=2199 3.9 - https://github.com/man-group/ArcticDB/actions/runs/13517989453/job/37771214446 What can be done to address massive leaks which should not be considered leaks is filtering see - https://github.com/man-group/ArcticDB/actions/runs/13517989379/job/37770659554?pr=2199 Overall raised issue to address flakiness and have stress tests run on debug build perhaps only single python version on only 3 runners with 3 different oses. That could give possibility to filter out frames we know are ok as in this example and reduce time for other functional tests #### What does this implement or fix? ## Change Type (Required) - [x] **Patch** (Bug fix or non-breaking improvement) - [ ] **Minor** (New feature, but backward compatible) - [ ] **Major** (Breaking changes) - [ ] **Cherry pick** #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> --------- Co-authored-by: Georgi Rusev <Georgi Rusev>
Reference Issues/PRs
More tests migrated to asv with S3 and new framework that allows LMDB, Amazon S3 and others:
ASV run on S3 without errors here: https://github.com/man-group/ArcticDB/actions/runs/13414624562/job/37472454850
New link after allmodifications: https://github.com/man-group/ArcticDB/actions/runs/13703208873
What does this implement or fix?
Change Type (Required)
Any other comments?
Checklist
Checklist for code changes...