Decompress stream #223

joverlee521 · 2021-10-19T17:43:55Z

Addresses the disk space limits that we are hitting in the GISAID ingest and the full Nextclade runs by keeping the NDJSON and FASTA files compressed. The .xz files are opened with the lzma module to decompress the data as they are streaming. See commits for more details.

The current uncompressed gisaid.ndjson file is already more than 120GB and will only continue to grow. Compress the downloaded data with `xz` to save disk space. The subsequent commits will reflect changes necessary to run `ingest-gisaid` using the compressed file.

The GISAID source file will be a compressed `.xz` file that needs to be decompressed before counting the number of records. The GenBank source file will remain as the raw file that does not need to be decompressed, so keep support for both.

ivan-aksamentov

Engineering-wise compelling, but I don't think we need this. Infrastructure should be adjusted instead.

Why would a $50000 96-core 4-socket server be attached to a $200 256Gb SSD? You probably have more storage on your laptop than that.

Storage is peanuts cheap and almost infinite. If I can have a NAS at home with 12TB of NVME SSDs + 100TB of HDDs, I am sure that one of the world's most important pandemic projects can figure something.

How much GISAID can grow realistically? 1TB next year? Still fits on my laptop. I would not spend a second of my life on optimizing for that.

We should make it all fast instead, so that scientists don't waste time waiting: uncompress everything and remove useless computations, not introduce more.

That's my hot take 😀🔥

tsibley

This seems like a worthwhile change to me. Thanks for addressing looming space issues, @joverlee521. A few notes below, but nothing blocking.

tsibley · 2021-10-20T17:35:44Z

bin/download-from-s3

        if [[ "$src" == *.gz ]]; then
            gunzip -cfq
-        elif  [[ "$src" == *.xz ]]; then
+        elif  [[ "$src" == *.xz && "$dst" != *.xz ]]; then
            xz -T0 -dcq


For consistency, let's add the same check of $dst for *.gz too?

tsibley · 2021-10-20T17:44:26Z

bin/upload-to-s3

        if [[ "$dst" == *.gz ]]; then
            gzip -c "$src"
-        elif  [[ "$dst" == *.xz ]]; then
+        elif  [[ "$src" != *.xz && "$dst" == *.xz ]]; then
            xz -2 -T0 -c "$src"


Ditto re: including the corresponding check of $src for *.gz.

tsibley · 2021-10-20T17:45:21Z

bin/transform-gisaid

@@ -81,7 +82,7 @@ if __name__ == '__main__':
            "e.g.\n\t"
            "Europe/Spain/Catalunya/Mataró\tEurope/Spain/Catalunya/Mataro\n\t")
    parser.add_argument("--output-metadata",
-        default=str( base / "data/gisaid/metadata.tsv" ), 
+        default=str( base / "data/gisaid/metadata.tsv" ),


There's several whitespace-only changes in the diff for this file which would be better split out in a separate commit. (I assume your editor made them automatically upon save.)

tsibley · 2021-10-20T21:10:11Z

@ivan-aksamentov I'm sure your comment was written in good faith, but it feels to me a bit too dismissive and flippant for a CR discussion. IMO, it's ok to say "Let's not do this at all" in a CR, but I think it's important to err on the side of treading more carefully with the way in which it's said. (Or said another way, fewer "hot takes" and more "considered takes", please.)

We should make it all fast instead, so that scientists don't waste time waiting: uncompress everything and remove useless computations, not introduce more.

I broadly agree with the sentiment that "faster is better", but I'll also note that by my count, while it shuffles some (de)compression steps around, on net this PR introduces no new compressions and only one new decompression: the reading in of gisaid.ndjson.xz with lzma.open() in transform-gisaid.

It's not clear to me if this ends up adding to overall runtime or not. I think it'll largely depend on if the processing speed of transform-gisaid's manipulation is faster or slower than the decompression rate of lzma.open(). We'd have to profile to see for sure.

Infrastructure should be adjusted instead.

I agree infra changes play an important role here, but I think there's room for both infra to be adjusted and for code to be made more space-conscious.

In terms of storage infra, I think the highest impact change to our Batch nodes might be making use of instance-local NVMe storage (via LVM pools) instead of further-enlarged EBS volumes for the container workloads. This is something I've been meaning to do for ages, and your comments spurred me to look into it again today and get a proof-of-concept working. I'll try to get that documented and deployed later this week or next.

ivan-aksamentov · 2021-10-20T22:05:21Z

Yeah, my comment ended up being super rude. Not sure how it happened and what triggered it. I did not realize when I wrote, but I feel really bad after re-reading it now and being called out. I agree, this is too hot compared to what we usually do here. I am sorry to all readers, especially @tsibley @joverlee521.

tsibley · 2021-10-21T20:07:04Z

Thanks, Ivan. 🙏 Communication is always a hard thing, and none of us always get it right.

FWIW, I do think there's plenty of room to disagree on approach here.

Skip the compression with `gzip` or `xz` if the source file is already a compressed `.gz` or `.xz` file.

Skip the decompression with `gzip` or `xz` if the destination file is expecting the compressed `gz` or `.xz` file.

Use the `lzma` module to read the compressed `gisaid.ndjson.xz` file so that we do not have to keep the uncompressed file on disk.

Previous commits updated the various scripts called within `ingest-gisaid` to support the compressed `gisaid.ndjson.xz` file. Update `ingest-gisaid` to use `gisaid.ndjson.xz` instead of the uncompressed `gisaid.ndjson` file.

Use the `lzma` module to read the compressed input FASTA so that we do not have to take up additional disk space with the uncompressed FASTA.

joverlee521 added 2 commits October 18, 2021 16:18

joverlee521 requested review from tsibley and ivan-aksamentov October 19, 2021 17:43

ivan-aksamentov reviewed Oct 19, 2021

View reviewed changes

tsibley approved these changes Oct 20, 2021

View reviewed changes

joverlee521 added 6 commits October 26, 2021 10:54

upload-to-s3: skip compression if src is compressed

2be0387

Skip the compression with `gzip` or `xz` if the source file is already a compressed `.gz` or `.xz` file.

download-from-s3: skip decompression if dst is compressed

7380f01

Skip the decompression with `gzip` or `xz` if the destination file is expecting the compressed `gz` or `.xz` file.

transform-gisaid: remove trailing whitespace

9415711

transform-gisaid: use lzma to read .xz file

3f3e1d4

Use the `lzma` module to read the compressed `gisaid.ndjson.xz` file so that we do not have to keep the uncompressed file on disk.

ingest-gisaid: use gisaid.ndjson.xz file

fcba89c

Previous commits updated the various scripts called within `ingest-gisaid` to support the compressed `gisaid.ndjson.xz` file. Update `ingest-gisaid` to use `gisaid.ndjson.xz` instead of the uncompressed `gisaid.ndjson` file.

run-nextclade-full: keep input FASTA compressed

7e4ff36

Use the `lzma` module to read the compressed input FASTA so that we do not have to take up additional disk space with the uncompressed FASTA.

joverlee521 force-pushed the decompress-stream branch from 934a2b4 to 7e4ff36 Compare October 26, 2021 18:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decompress stream #223

Decompress stream #223

joverlee521 commented Oct 19, 2021

ivan-aksamentov left a comment •

edited

Loading

tsibley left a comment

tsibley Oct 20, 2021

tsibley Oct 20, 2021

tsibley Oct 20, 2021

tsibley commented Oct 20, 2021

ivan-aksamentov commented Oct 20, 2021 •

edited

Loading

tsibley commented Oct 21, 2021

Decompress stream #223

Are you sure you want to change the base?

Decompress stream #223

Conversation

joverlee521 commented Oct 19, 2021

ivan-aksamentov left a comment • edited Loading

Choose a reason for hiding this comment

tsibley left a comment

Choose a reason for hiding this comment

tsibley Oct 20, 2021

Choose a reason for hiding this comment

tsibley Oct 20, 2021

Choose a reason for hiding this comment

tsibley Oct 20, 2021

Choose a reason for hiding this comment

tsibley commented Oct 20, 2021

ivan-aksamentov commented Oct 20, 2021 • edited Loading

tsibley commented Oct 21, 2021

ivan-aksamentov left a comment •

edited

Loading

ivan-aksamentov commented Oct 20, 2021 •

edited

Loading