-
Notifications
You must be signed in to change notification settings - Fork 20
Use Zarr library directly as a parallel write sink. #60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Looks like MCDS.open doesn't like the outputs of these (fails on obs_dim lookup with a KeyError) so...bear with me i guess. |
9e99092
to
e9fcd17
Compare
Siloing out the zarr v3 stuff into a PR for a later date, was more broken than I thought. |
b7f69d0
to
e178fc3
Compare
Okay, I can confirm it actually works now. |
Well, it works in the single chunk, small dataset case. I don't know whether increasing the chunk size broke it or if large datasets don't work. not great either way-testing single chunk right now. |
Judging by my script actually taking the time to process my data this go-around, i think its reasonable to say that this PR works on chunk-size 1 and no other. That being said I don't actually know that there's an advantage to multiple-cell chunks when you cut the copy step out of the equation, as they are fully independent. |
Hi @theAeon , thank you for the PR, sorry I didn't promptly follow this. I'm not actively working on ALLCools right now, and would only merge bug fix while try not do make breaking changes or add new functions. I'm not exactly sure what caused your use case being so slow, we have previously generated MCDS for 100ks of cells, which seems to have acceptable speed. Do you want to provide some details about your use case?
|
If I'm remembering correctly, the difference in the use case is that our region definition is ~400k tiny sites, which causes the output of write_single_zarr to be much larger and causes i/o hangs. As to breaking changes, well, the goal here is to not break anything-which is why i brought up the chunk-size thing! Still working on validation-my most recent attempt didn't actually work but I also forgot to apply an unrelated patch so I'll keep you posted on that one. |
I see, yes, MCDS was not intent to handle 100ks of small features/mC sites, much larger feature number may cause performance issue. We had similar needs previously so I wrote this BaseDS class, which aims to combine bunch of ALLC files at single BP level across genome. Feel free to check if this is relevant to you. |
I think-although I could be incorrect-that the data we're working with is a combination of SBP and very small regions of several BP, so if I can get MCDS to work I would like it to. Current status of my runthrough is that it appears to be doing the calculations as expected but is not actually writing the data to disk-so i think i may have broken something else when I added the dictionary of regiongroups. Will keep posted, as usual. |
A new update on this-it appears that the no-output issue I'm having is not due to chunk size or the dictionary of regiongroups. It seems that above some number of cells, its not outputting anything at all ....edit: or not, it just failed on a small set too. not sure what's happening |
....oh. malformed allc_table. bear with |
okay, so it turns out on small datasets it does in fact work as intended for both single and multiple obs_dim chunks. Will push dataset.py as i have it after confirming that it works at scale. |
Looks like its working, multi-cell chunks and all. Apologies for the delay-was using the wrong chromosome set on my end (among other things). |
semi-aside-i did just get this working w/ mpi4py's MPIPoolExecutor. Not sure how you'd want me to integrate that. |
This eliminates the full readback of the zarr chunks that was done to merge the outputs of the parallel _write_single_zarr block and is saving me a significant amount of time on large datasets.
Specifically, a 60 cell test dataset that was taking several hours in the zarr merge phase completed in 5 minutes. No I/O limited copy step.