-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
filter: Split filtering and subsampling #1432
base: master
Are you sure you want to change the base?
Conversation
This ensures that no output files are written on error.
This is more memory efficient for columns with many duplicate values, which can be expected in most use cases. ¹ <https://pandas.pydata.org/pandas-docs/version/1.5/user_guide/scale.html#use-efficient-datatypes>
This simplifies the process and allows for more portable subsampling functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
52fbad6 passes all Cram tests but fails on the ncov trial run:
File "/nextstrain/augur/augur/io/metadata.py", line 155, in read_metadata
return pd.read_csv(
…
OSError: [Errno 12] Cannot allocate memory
This means that given its current and growing size, the SARS-CoV-2 dataset is impractical to load into memory all at once despite optimizations such as loading a subset of columns and 8d7206c. The only solution for more modular and portable functions in the codebase is an on-disk approach, which Pandas does not support. There is a spectrum of alternatives, which I think can be divided into two categories:
- Pandas-like alternative such as Dask. Unsure how portable the existing Pandas logic is to Dask, but ideally this would be close to a library swap with minimal code change.
- Database file approach such as SQLite. Started in filter: Rewrite using SQLite3 #1242. It's essentially a rewrite of
augur filter
, requiring extensive testing. The other downside is this will require some form of Pandas roundtripping to continue supporting the widely-used Pandas-based queries in--query
.
I think it's reasonable to explore (1) on top of current changes in this PR to see if it's a viable option. Even if the end goal is (2), (1) would provide a good stepping stone in the direction of the database approach.
I agree with this, @victorlin, and I'm sorry I didn't notice this PR direction earlier since I probably could have saved you some effort here. The original idea that motivated this PR's work depends unrealistically on keeping all data in memory, but augur filter couldn't load all available SARS-CoV-2 data in memory as of June 2021. At that time, @tsibley and I discussed some of the options you've mentioned above, including Dask and SQLite databases. Even though we both preferred a database solution, we opted for the current chunked iterator approach because it was simpler to implement under the time pressure to keep the ncov workflow running. Now that we have a functional (albeit slow) filter command, we have more freedom to push in a more experimental direction like the database approach you've led. If a database-backed augur filter is a key step toward supporting flexible subsampling, it makes sense to me to devote more effort as a team toward that work. It also seems like the current augur subsample prototype could be useful immediately even if it isn't as fast or otherwise optimal as we'd like. I think this might be @jameshadfield's feeling about the situation, too. I could imagine that the path to better and faster subsampling could be to 1) decide on the minimal viable |
@huddlej no worries at all, thanks for chiming in! I agree that the external interface is more important. I opened this PR because the thought of ditching the chunked approach keeps coming up in my mind while thinking about the internal implementation of When we get back to the work on speeding up internal subsampling logic, some experimentation with Dask may be worthwhile to check if it's a viable alternative to a complete rewrite with SQLite (or at least used in place of the chunked Pandas querying currently used in #1242). EDIT: Dumped some Dask experimentation into a commit: 946d93f |
Yes -- |
How far could we get with just using tsv-utils?
|
I just took
The weighted sampling available in |
Description of proposed changes
This is motivated by the idea that a future more sophisticated subsampling command would benefit from exposing modularized subsampling functions from the existing
augur filter
code. In order to do that, it's much easier if the subsampling is decoupled from other parts ofaugur filter
, namely the parts that coordinate strain lists and other variables among different chunks of metadata.With the recent optimizations to reduce memory usage, it might be feasible to do everything in one pass of metadata, with all rows loaded into memory at once.
Related issue(s)
Checklist