Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decompress stream #223

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

Decompress stream #223

wants to merge 8 commits into from

Conversation

joverlee521
Copy link
Contributor

Addresses the disk space limits that we are hitting in the GISAID ingest and the full Nextclade runs by keeping the NDJSON and FASTA files compressed. The .xz files are opened with the lzma module to decompress the data as they are streaming. See commits for more details.

The current uncompressed gisaid.ndjson file is already more than 120GB
and will only continue to grow. Compress the downloaded data with `xz`
to save disk space. The subsequent commits will reflect changes
necessary to run `ingest-gisaid` using the compressed file.
The GISAID source file will be a compressed `.xz` file that needs to be
decompressed before counting the number of records. The GenBank source
file will remain as the raw file that does not need to be decompressed,
so keep support for both.
Copy link
Member

@ivan-aksamentov ivan-aksamentov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Engineering-wise compelling, but I don't think we need this. Infrastructure should be adjusted instead.

Why would a $50000 96-core 4-socket server be attached to a $200 256Gb SSD? You probably have more storage on your laptop than that.

Storage is peanuts cheap and almost infinite. If I can have a NAS at home with 12TB of NVME SSDs + 100TB of HDDs, I am sure that one of the world's most important pandemic projects can figure something.

How much GISAID can grow realistically? 1TB next year? Still fits on my laptop. I would not spend a second of my life on optimizing for that.

We should make it all fast instead, so that scientists don't waste time waiting: uncompress everything and remove useless computations, not introduce more.

That's my hot take 😀🔥

Copy link
Member

@tsibley tsibley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a worthwhile change to me. Thanks for addressing looming space issues, @joverlee521. A few notes below, but nothing blocking.

Comment on lines 20 to 23
if [[ "$src" == *.gz ]]; then
gunzip -cfq
elif [[ "$src" == *.xz ]]; then
elif [[ "$src" == *.xz && "$dst" != *.xz ]]; then
xz -T0 -dcq
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, let's add the same check of $dst for *.gz too?

bin/upload-to-s3 Outdated
Comment on lines 35 to 38
if [[ "$dst" == *.gz ]]; then
gzip -c "$src"
elif [[ "$dst" == *.xz ]]; then
elif [[ "$src" != *.xz && "$dst" == *.xz ]]; then
xz -2 -T0 -c "$src"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto re: including the corresponding check of $src for *.gz.

@@ -81,7 +82,7 @@ if __name__ == '__main__':
"e.g.\n\t"
"Europe/Spain/Catalunya/Mataró\tEurope/Spain/Catalunya/Mataro\n\t")
parser.add_argument("--output-metadata",
default=str( base / "data/gisaid/metadata.tsv" ),
default=str( base / "data/gisaid/metadata.tsv" ),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's several whitespace-only changes in the diff for this file which would be better split out in a separate commit. (I assume your editor made them automatically upon save.)

@tsibley
Copy link
Member

tsibley commented Oct 20, 2021

@ivan-aksamentov I'm sure your comment was written in good faith, but it feels to me a bit too dismissive and flippant for a CR discussion. IMO, it's ok to say "Let's not do this at all" in a CR, but I think it's important to err on the side of treading more carefully with the way in which it's said. (Or said another way, fewer "hot takes" and more "considered takes", please.)

We should make it all fast instead, so that scientists don't waste time waiting: uncompress everything and remove useless computations, not introduce more.

I broadly agree with the sentiment that "faster is better", but I'll also note that by my count, while it shuffles some (de)compression steps around, on net this PR introduces no new compressions and only one new decompression: the reading in of gisaid.ndjson.xz with lzma.open() in transform-gisaid.

It's not clear to me if this ends up adding to overall runtime or not. I think it'll largely depend on if the processing speed of transform-gisaid's manipulation is faster or slower than the decompression rate of lzma.open(). We'd have to profile to see for sure.

Infrastructure should be adjusted instead.

I agree infra changes play an important role here, but I think there's room for both infra to be adjusted and for code to be made more space-conscious.

In terms of storage infra, I think the highest impact change to our Batch nodes might be making use of instance-local NVMe storage (via LVM pools) instead of further-enlarged EBS volumes for the container workloads. This is something I've been meaning to do for ages, and your comments spurred me to look into it again today and get a proof-of-concept working. I'll try to get that documented and deployed later this week or next.

@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Oct 20, 2021

Yeah, my comment ended up being super rude. Not sure how it happened and what triggered it. I did not realize when I wrote, but I feel really bad after re-reading it now and being called out. I agree, this is too hot compared to what we usually do here. I am sorry to all readers, especially @tsibley @joverlee521.

@tsibley
Copy link
Member

tsibley commented Oct 21, 2021

Thanks, Ivan. 🙏 Communication is always a hard thing, and none of us always get it right.

FWIW, I do think there's plenty of room to disagree on approach here.

Skip the compression with `gzip` or `xz` if the source file is already a
compressed `.gz` or `.xz` file.
Skip the decompression with `gzip` or `xz` if the destination file is
expecting the compressed `gz` or `.xz` file.
Use the `lzma` module to read the compressed `gisaid.ndjson.xz` file so
that we do not have to keep the uncompressed file on disk.
Previous commits updated the various scripts called within
`ingest-gisaid` to support the compressed `gisaid.ndjson.xz` file.
Update `ingest-gisaid` to use `gisaid.ndjson.xz` instead of the
uncompressed `gisaid.ndjson` file.
Use the `lzma` module to read the compressed input FASTA so that we do
not have to take up additional disk space with the uncompressed FASTA.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

3 participants