-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Copy whole shards when writing v3, instead of individual chunks #6
Copy whole shards when writing v3, instead of individual chunks #6
Conversation
See glencoesoftware#3. This dramatically reduces conversion time when sharding is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having re-used the same conversion sequence as in #2 (comment) i.e.
time ~/zarr2zarr-0.0.1-SNAPSHOT/bin/zarr2zarr /data/idr0048/9846151.zarr/ /data/idr0048/zarr2zarr/9846151_zstd.zarr --compression zstd
time ~/zarr2zarr-0.0.1-SNAPSHOT/bin/zarr2zarr /data/idr0048/9846151.zarr/ /data/idr0048/zarr2zarr/9846151_4096x4096shards_zstd.zarr --compression zstd --shard=1,1,1,4096,4096
time ~/zarr2zarr-0.0.1-SNAPSHOT/bin/zarr2zarr /data/idr0048/9846151.zarr/ /data/idr0048/zarr2zarr/9846151_4x4096x4096shards_zstd.zarr --compression zstd --shard=1,1,4,4096,4096
time ~/zarr2zarr-0.0.1-SNAPSHOT/bin/zarr2zarr /data/idr0048/9846151.zarr/ /data/idr0048/zarr2zarr/9846151_2x2x2048x2048_zstd.zarr --compression zstd --shard=1,2,2,2048,2048
time ~/zarr2zarr-0.0.1-SNAPSHOT/bin/zarr2zarr /data/idr0048//9846151.zarr/ /data/idr0048/zarr2zarr/9846151_superchunk_zstd.zarr --compression zstd --shard=SUPERCHUNK
I had the following results with this PR included
21:51:12.050 [main] INFO com.glencoesoftware.zarr.Convert -- opened /data/idr0048/9846151.zarr/0
21:51:12.084 [main] INFO com.glencoesoftware.zarr.Convert -- got 2 series attributes
21:51:14.512 [main] INFO com.glencoesoftware.zarr.Convert -- found 6 resolutions
21:51:14.852 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/0
00:31:36.484 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/1
01:15:26.822 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/2
01:28:16.501 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/3
01:32:29.707 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/4
01:34:52.658 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/5
real 225m36.998s
user 77m43.101s
sys 3m21.589s
01:36:47.146 [main] INFO com.glencoesoftware.zarr.Convert -- opened /data/idr0048/9846151.zarr/0
01:36:47.188 [main] INFO com.glencoesoftware.zarr.Convert -- got 2 series attributes
01:36:49.328 [main] INFO com.glencoesoftware.zarr.Convert -- found 6 resolutions
01:36:50.041 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/0
04:15:00.500 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/1
05:00:20.404 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/2
05:00:20.405 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
05:12:46.041 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/3
05:12:46.042 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
05:16:29.471 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/4
05:16:29.472 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
05:18:45.525 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/5
05:18:45.526 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
real 223m26.852s
user 130m20.766s
sys 3m12.832s
05:20:18.938 [main] INFO com.glencoesoftware.zarr.Convert -- opened /data/idr0048/9846151.zarr/0
05:20:18.959 [main] INFO com.glencoesoftware.zarr.Convert -- got 2 series attributes
05:20:21.360 [main] INFO com.glencoesoftware.zarr.Convert -- found 6 resolutions
05:20:21.518 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/0
07:59:17.487 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/1
08:42:50.813 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/2
08:42:50.813 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
08:54:48.696 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/3
08:54:48.696 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
08:58:23.340 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/4
08:58:23.341 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
09:00:49.126 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/5
09:00:49.126 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
real 222m6.929s
user 122m46.826s
sys 3m3.236s
09:02:21.516 [main] INFO com.glencoesoftware.zarr.Convert -- opened /data/idr0048/9846151.zarr/0
09:02:21.626 [main] INFO com.glencoesoftware.zarr.Convert -- got 2 series attributes
09:02:24.026 [main] INFO com.glencoesoftware.zarr.Convert -- found 6 resolutions
09:02:24.111 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/0
11:48:24.954 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/1
12:32:13.848 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/2
12:32:13.855 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
12:43:10.006 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/3
12:43:10.008 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
12:46:33.397 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/4
12:46:33.398 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
12:48:17.608 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/5
12:48:17.621 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
real 227m5.570s
user 123m44.328s
sys 3m29.419s
12:49:25.350 [main] INFO com.glencoesoftware.zarr.Convert -- opened /data/idr0048/9846151.zarr/0
12:49:25.372 [main] INFO com.glencoesoftware.zarr.Convert -- got 2 series attributes
12:49:27.539 [main] INFO com.glencoesoftware.zarr.Convert -- found 6 resolutions
12:49:27.591 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/0
16:04:14.599 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/1
17:05:34.875 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/2
17:23:01.680 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/3
17:28:19.165 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/4
17:30:55.622 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/5
real 282m41.054s
user 138m37.920s
sys 3m35.937s
So conversion times when specifying custom shard sizes are now directly comparable to the ones when using simple chunks. Only the SUPERCHUNK configuration has an increase conversion time although I might reproduce this overnight.
This means that shard sizes must be small enough to fit in memory, which I expect is OK for our testing at least for the moment.
At least for the above, shard sizes will be up to 64M in memory (in reality 49M with compression) which is definitely acceptable with the memory requirement. I will try and increase the number of Z in a shard to confirm this does not cause any issue.
If anything I would assume this might start to be an issue for really large shard sizes and/or for concurrent shard writing as multiple shards will need to be in memory at a given time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retested with the same input data and condition as above, adding 1,1,8,4096,4096
1,1,16,4096,4096
and 1,3,4,4096,4096
to the list of custom shard sizes, using a testing environment with a SSD volume. Conversion times are as follow
08:12:02.409 [main] INFO com.glencoesoftware.zarr.Convert -- opened /data/idr0048/9846151.zarr/0
08:12:02.417 [main] INFO com.glencoesoftware.zarr.Convert -- got 2 series attributes
08:12:05.084 [main] INFO com.glencoesoftware.zarr.Convert -- found 6 resolutions
08:12:05.115 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/0
09:09:49.355 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/1
09:25:51.279 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/2
09:29:47.043 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/3
09:30:43.365 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/4
09:31:03.529 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/5
real 79m23.618s
user 71m20.371s
sys 3m0.407s
09:31:26.961 [main] INFO com.glencoesoftware.zarr.Convert -- opened /data/idr0048/9846151.zarr/0
09:31:26.973 [main] INFO com.glencoesoftware.zarr.Convert -- got 2 series attributes
09:31:29.415 [main] INFO com.glencoesoftware.zarr.Convert -- found 6 resolutions
09:31:29.442 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/0
10:36:14.852 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/1
10:57:02.235 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/2
10:57:02.235 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
11:00:47.651 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/3
11:00:47.652 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
11:01:44.050 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/4
11:01:44.050 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
11:02:08.027 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/5
11:02:08.027 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
real 91m3.969s
user 121m34.048s
sys 3m2.674s
11:02:28.502 [main] INFO com.glencoesoftware.zarr.Convert -- opened /data/idr0048/9846151.zarr/0
11:02:28.510 [main] INFO com.glencoesoftware.zarr.Convert -- got 2 series attributes
11:02:30.827 [main] INFO com.glencoesoftware.zarr.Convert -- found 6 resolutions
11:02:30.851 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/0
12:08:10.944 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/1
12:26:40.851 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/2
12:26:40.851 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
12:30:34.059 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/3
12:30:34.059 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
12:31:28.594 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/4
12:31:28.594 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
12:31:49.989 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/5
12:31:49.990 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
real 89m42.213s
user 112m11.676s
sys 3m4.108s
12:32:12.108 [main] INFO com.glencoesoftware.zarr.Convert -- opened /data/idr0048/9846151.zarr/0
12:32:12.115 [main] INFO com.glencoesoftware.zarr.Convert -- got 2 series attributes
12:32:14.833 [main] INFO com.glencoesoftware.zarr.Convert -- found 6 resolutions
12:32:14.855 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/0
13:40:45.401 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/1
14:02:33.175 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/2
14:02:33.175 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
14:06:36.353 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/3
14:06:36.354 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
14:07:31.194 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/4
14:07:31.194 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
14:07:51.936 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/5
14:07:51.936 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
real 96m2.120s
user 116m41.365s
sys 3m9.455s
14:08:14.287 [main] INFO com.glencoesoftware.zarr.Convert -- opened /data/idr0048/9846151.zarr/0
14:08:14.356 [main] INFO com.glencoesoftware.zarr.Convert -- got 2 series attributes
14:08:16.874 [main] INFO com.glencoesoftware.zarr.Convert -- found 6 resolutions
14:08:16.902 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/0
15:14:38.823 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/1
15:33:31.885 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/2
15:33:31.886 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
15:37:31.094 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/3
15:37:31.095 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
15:38:27.262 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/4
15:38:27.262 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
15:38:47.028 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/5
15:38:47.028 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
real 90m58.320s
user 111m13.582s
sys 2m59.239s
15:39:12.736 [main] INFO com.glencoesoftware.zarr.Convert -- opened /data/idr0048/9846151.zarr/0
15:39:12.749 [main] INFO com.glencoesoftware.zarr.Convert -- got 2 series attributes
15:39:15.349 [main] INFO com.glencoesoftware.zarr.Convert -- found 6 resolutions
15:39:15.369 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/0
16:49:31.495 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/1
17:09:15.375 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/2
17:09:15.375 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
17:13:18.603 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/3
17:13:18.603 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
17:14:18.800 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/4
17:14:18.800 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
17:14:39.693 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/5
17:14:39.693 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
real 95m51.098s
user 132m5.595s
sys 3m42.088s
17:15:05.710 [main] INFO com.glencoesoftware.zarr.Convert -- opened /data/idr0048/9846151.zarr/0
17:15:05.751 [main] INFO com.glencoesoftware.zarr.Convert -- got 2 series attributes
17:15:08.382 [main] INFO com.glencoesoftware.zarr.Convert -- found 6 resolutions
17:15:08.411 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/0
18:24:53.373 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/1
18:44:25.192 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/2
18:44:25.192 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
18:48:25.141 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/3
18:48:25.141 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
18:49:21.682 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/4
18:49:21.683 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
18:49:42.131 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/5
18:49:42.131 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
real 95m1.984s
user 123m44.123s
sys 3m29.609s
18:50:05.358 [main] INFO com.glencoesoftware.zarr.Convert -- opened /data/idr0048/9846151.zarr/0
18:50:05.374 [main] INFO com.glencoesoftware.zarr.Convert -- got 2 series attributes
18:50:08.258 [main] INFO com.glencoesoftware.zarr.Convert -- found 6 resolutions
18:50:08.278 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/0
19:59:32.100 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/1
20:21:27.305 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/2
20:27:18.252 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/3
20:28:55.101 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/4
20:29:26.040 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0048/9846151.zarr/0/5
real 99m45.236s
user 133m23.738s
sys 3m28.476s
Overall this puts us in a much better situation than previously when adding sharding to the codecs. As indicated previously, we can review the assumption that all codecs should be in-memory as we work onto multi-threading but immediately, it makes sense to merge this.
See #3. This dramatically reduces conversion time when sharding is used.
With the same test setup as in #3 (comment), I now see:
This means that shard sizes must be small enough to fit in memory, which I expect is OK for our testing at least for the moment.