Use Zarr library directly as a parallel write sink. #60

theAeon · 2024-12-21T00:06:15Z

This eliminates the full readback of the zarr chunks that was done to merge the outputs of the parallel _write_single_zarr block and is saving me a significant amount of time on large datasets.

Specifically, a 60 cell test dataset that was taking several hours in the zarr merge phase completed in 5 minutes. No I/O limited copy step.

for more information, see https://pre-commit.ci

theAeon · 2025-02-06T03:36:03Z

Looks like MCDS.open doesn't like the outputs of these (fails on obs_dim lookup with a KeyError) so...bear with me i guess.

theAeon · 2025-02-16T13:54:28Z

Siloing out the zarr v3 stuff into a PR for a later date, was more broken than I thought.

theAeon · 2025-02-17T02:24:38Z

Okay, I can confirm it actually works now.

theAeon · 2025-02-17T06:59:48Z

Well, it works in the single chunk, small dataset case. I don't know whether increasing the chunk size broke it or if large datasets don't work. not great either way-testing single chunk right now.

theAeon · 2025-02-17T19:58:26Z

Judging by my script actually taking the time to process my data this go-around, i think its reasonable to say that this PR works on chunk-size 1 and no other.

That being said I don't actually know that there's an advantage to multiple-cell chunks when you cut the copy step out of the equation, as they are fully independent.

lhqing · 2025-02-21T15:54:29Z

Hi @theAeon , thank you for the PR, sorry I didn't promptly follow this. I'm not actively working on ALLCools right now, and would only merge bug fix while try not do make breaking changes or add new functions.

I'm not exactly sure what caused your use case being so slow, we have previously generated MCDS for 100ks of cells, which seems to have acceptable speed. Do you want to provide some details about your use case?

Specifically, a 60 cell test dataset that was taking several hours in the zarr merge phase completed in 5 minutes. No I/O limited copy step.

theAeon · 2025-02-21T18:18:55Z

If I'm remembering correctly, the difference in the use case is that our region definition is ~400k tiny sites, which causes the output of write_single_zarr to be much larger and causes i/o hangs.

As to breaking changes, well, the goal here is to not break anything-which is why i brought up the chunk-size thing! Still working on validation-my most recent attempt didn't actually work but I also forgot to apply an unrelated patch so I'll keep you posted on that one.

lhqing · 2025-02-24T16:33:21Z

I see, yes, MCDS was not intent to handle 100ks of small features/mC sites, much larger feature number may cause performance issue.

We had similar needs previously so I wrote this BaseDS class, which aims to combine bunch of ALLC files at single BP level across genome. Feel free to check if this is relevant to you.
https://github.com/lhqing/ALLCools/blob/c9f7be2ffd650c1a5430d2f6fc252c60f0c28e33/ALLCools/count_matrix/base_ds.py#L559C5-L559C21

theAeon · 2025-02-24T19:32:58Z

I think-although I could be incorrect-that the data we're working with is a combination of SBP and very small regions of several BP, so if I can get MCDS to work I would like it to.

Current status of my runthrough is that it appears to be doing the calculations as expected but is not actually writing the data to disk-so i think i may have broken something else when I added the dictionary of regiongroups. Will keep posted, as usual.

theAeon · 2025-02-28T16:29:01Z

A new update on this-it appears that the no-output issue I'm having is not due to chunk size or the dictionary of regiongroups. It seems that above some number of cells, its not outputting anything at all

....edit: or not, it just failed on a small set too. not sure what's happening

theAeon · 2025-02-28T16:55:13Z

....oh. malformed allc_table. bear with

theAeon · 2025-02-28T18:09:56Z

okay, so it turns out on small datasets it does in fact work as intended for both single and multiple obs_dim chunks. Will push dataset.py as i have it after confirming that it works at scale.

theAeon · 2025-03-07T13:14:38Z

Looks like its working, multi-cell chunks and all. Apologies for the delay-was using the wrong chromosome set on my end (among other things).

theAeon · 2025-03-07T20:50:38Z

semi-aside-i did just get this working w/ mpi4py's MPIPoolExecutor. Not sure how you'd want me to integrate that.

theAeon and others added 3 commits December 20, 2024 19:03

use direct zarr sink

c3f036d

[pre-commit.ci] auto fixes from pre-commit.com hooks

1d0ac9e

for more information, see https://pre-commit.ci

add dependency on zarr under version 3

09894b7

maybe fix lack of obs_dim

e9fcd17

theAeon force-pushed the zarr_parallel branch from 9e99092 to e9fcd17 Compare February 16, 2025 13:53

pre-commit-ci bot and others added 2 commits February 16, 2025 21:22

fix lack of obs_dim

0cc2b36

properly iterate over datasets

e178fc3

theAeon force-pushed the zarr_parallel branch from b7f69d0 to e178fc3 Compare February 17, 2025 02:23

Merge branch 'lhqing:master' into zarr_parallel

72493f2

iterate over ALL counts for every region

a983c2c

theAeon force-pushed the zarr_parallel branch from a372ca9 to a983c2c Compare March 7, 2025 13:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use Zarr library directly as a parallel write sink. #60

Use Zarr library directly as a parallel write sink. #60

Uh oh!

theAeon commented Dec 21, 2024 •

edited

Loading

Uh oh!

theAeon commented Feb 6, 2025

Uh oh!

theAeon commented Feb 16, 2025

Uh oh!

theAeon commented Feb 17, 2025

Uh oh!

theAeon commented Feb 17, 2025

Uh oh!

theAeon commented Feb 17, 2025 •

edited

Loading

Uh oh!

lhqing commented Feb 21, 2025

Uh oh!

theAeon commented Feb 21, 2025 •

edited

Loading

Uh oh!

lhqing commented Feb 24, 2025

Uh oh!

theAeon commented Feb 24, 2025

Uh oh!

theAeon commented Feb 28, 2025 •

edited

Loading

Uh oh!

theAeon commented Feb 28, 2025

Uh oh!

theAeon commented Feb 28, 2025

Uh oh!

theAeon commented Mar 7, 2025

Uh oh!

theAeon commented Mar 7, 2025

Uh oh!

Uh oh!

Use Zarr library directly as a parallel write sink. #60

Are you sure you want to change the base?

Use Zarr library directly as a parallel write sink. #60

Uh oh!

Conversation

theAeon commented Dec 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

theAeon commented Feb 6, 2025

Uh oh!

theAeon commented Feb 16, 2025

Uh oh!

theAeon commented Feb 17, 2025

Uh oh!

theAeon commented Feb 17, 2025

Uh oh!

theAeon commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhqing commented Feb 21, 2025

Uh oh!

theAeon commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhqing commented Feb 24, 2025

Uh oh!

theAeon commented Feb 24, 2025

Uh oh!

theAeon commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

theAeon commented Feb 28, 2025

Uh oh!

theAeon commented Feb 28, 2025

Uh oh!

theAeon commented Mar 7, 2025

Uh oh!

theAeon commented Mar 7, 2025

Uh oh!

Uh oh!

theAeon commented Dec 21, 2024 •

edited

Loading

theAeon commented Feb 17, 2025 •

edited

Loading

theAeon commented Feb 21, 2025 •

edited

Loading

theAeon commented Feb 28, 2025 •

edited

Loading