Skip to content

Use Zarr library directly as a parallel write sink. #60

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

theAeon
Copy link

@theAeon theAeon commented Dec 21, 2024

This eliminates the full readback of the zarr chunks that was done to merge the outputs of the parallel _write_single_zarr block and is saving me a significant amount of time on large datasets.

Specifically, a 60 cell test dataset that was taking several hours in the zarr merge phase completed in 5 minutes. No I/O limited copy step.

@theAeon
Copy link
Author

theAeon commented Feb 6, 2025

Looks like MCDS.open doesn't like the outputs of these (fails on obs_dim lookup with a KeyError) so...bear with me i guess.

@theAeon
Copy link
Author

theAeon commented Feb 16, 2025

Siloing out the zarr v3 stuff into a PR for a later date, was more broken than I thought.

@theAeon
Copy link
Author

theAeon commented Feb 17, 2025

Okay, I can confirm it actually works now.

@theAeon
Copy link
Author

theAeon commented Feb 17, 2025

Well, it works in the single chunk, small dataset case. I don't know whether increasing the chunk size broke it or if large datasets don't work. not great either way-testing single chunk right now.

@theAeon
Copy link
Author

theAeon commented Feb 17, 2025

Judging by my script actually taking the time to process my data this go-around, i think its reasonable to say that this PR works on chunk-size 1 and no other.

That being said I don't actually know that there's an advantage to multiple-cell chunks when you cut the copy step out of the equation, as they are fully independent.

@lhqing
Copy link
Owner

lhqing commented Feb 21, 2025

Hi @theAeon , thank you for the PR, sorry I didn't promptly follow this. I'm not actively working on ALLCools right now, and would only merge bug fix while try not do make breaking changes or add new functions.

I'm not exactly sure what caused your use case being so slow, we have previously generated MCDS for 100ks of cells, which seems to have acceptable speed. Do you want to provide some details about your use case?

Specifically, a 60 cell test dataset that was taking several hours in the zarr merge phase completed in 5 minutes. No I/O limited copy step.

@theAeon
Copy link
Author

theAeon commented Feb 21, 2025

If I'm remembering correctly, the difference in the use case is that our region definition is ~400k tiny sites, which causes the output of write_single_zarr to be much larger and causes i/o hangs.

As to breaking changes, well, the goal here is to not break anything-which is why i brought up the chunk-size thing! Still working on validation-my most recent attempt didn't actually work but I also forgot to apply an unrelated patch so I'll keep you posted on that one.

@lhqing
Copy link
Owner

lhqing commented Feb 24, 2025

I see, yes, MCDS was not intent to handle 100ks of small features/mC sites, much larger feature number may cause performance issue.

We had similar needs previously so I wrote this BaseDS class, which aims to combine bunch of ALLC files at single BP level across genome. Feel free to check if this is relevant to you.
https://github.com/lhqing/ALLCools/blob/c9f7be2ffd650c1a5430d2f6fc252c60f0c28e33/ALLCools/count_matrix/base_ds.py#L559C5-L559C21

@theAeon
Copy link
Author

theAeon commented Feb 24, 2025

I think-although I could be incorrect-that the data we're working with is a combination of SBP and very small regions of several BP, so if I can get MCDS to work I would like it to.

Current status of my runthrough is that it appears to be doing the calculations as expected but is not actually writing the data to disk-so i think i may have broken something else when I added the dictionary of regiongroups. Will keep posted, as usual.

@theAeon
Copy link
Author

theAeon commented Feb 28, 2025

A new update on this-it appears that the no-output issue I'm having is not due to chunk size or the dictionary of regiongroups. It seems that above some number of cells, its not outputting anything at all

....edit: or not, it just failed on a small set too. not sure what's happening

@theAeon
Copy link
Author

theAeon commented Feb 28, 2025

....oh. malformed allc_table. bear with

@theAeon
Copy link
Author

theAeon commented Feb 28, 2025

okay, so it turns out on small datasets it does in fact work as intended for both single and multiple obs_dim chunks. Will push dataset.py as i have it after confirming that it works at scale.

@theAeon
Copy link
Author

theAeon commented Mar 7, 2025

Looks like its working, multi-cell chunks and all. Apologies for the delay-was using the wrong chromosome set on my end (among other things).

@theAeon
Copy link
Author

theAeon commented Mar 7, 2025

semi-aside-i did just get this working w/ mpi4py's MPIPoolExecutor. Not sure how you'd want me to integrate that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants