-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decompress stream #223
base: master
Are you sure you want to change the base?
Decompress stream #223
Conversation
The current uncompressed gisaid.ndjson file is already more than 120GB and will only continue to grow. Compress the downloaded data with `xz` to save disk space. The subsequent commits will reflect changes necessary to run `ingest-gisaid` using the compressed file.
The GISAID source file will be a compressed `.xz` file that needs to be decompressed before counting the number of records. The GenBank source file will remain as the raw file that does not need to be decompressed, so keep support for both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Engineering-wise compelling, but I don't think we need this. Infrastructure should be adjusted instead.
Why would a $50000 96-core 4-socket server be attached to a $200 256Gb SSD? You probably have more storage on your laptop than that.
Storage is peanuts cheap and almost infinite. If I can have a NAS at home with 12TB of NVME SSDs + 100TB of HDDs, I am sure that one of the world's most important pandemic projects can figure something.
How much GISAID can grow realistically? 1TB next year? Still fits on my laptop. I would not spend a second of my life on optimizing for that.
We should make it all fast instead, so that scientists don't waste time waiting: uncompress everything and remove useless computations, not introduce more.
That's my hot take 😀🔥
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a worthwhile change to me. Thanks for addressing looming space issues, @joverlee521. A few notes below, but nothing blocking.
bin/download-from-s3
Outdated
if [[ "$src" == *.gz ]]; then | ||
gunzip -cfq | ||
elif [[ "$src" == *.xz ]]; then | ||
elif [[ "$src" == *.xz && "$dst" != *.xz ]]; then | ||
xz -T0 -dcq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, let's add the same check of $dst
for *.gz
too?
bin/upload-to-s3
Outdated
if [[ "$dst" == *.gz ]]; then | ||
gzip -c "$src" | ||
elif [[ "$dst" == *.xz ]]; then | ||
elif [[ "$src" != *.xz && "$dst" == *.xz ]]; then | ||
xz -2 -T0 -c "$src" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto re: including the corresponding check of $src
for *.gz
.
@@ -81,7 +82,7 @@ if __name__ == '__main__': | |||
"e.g.\n\t" | |||
"Europe/Spain/Catalunya/Mataró\tEurope/Spain/Catalunya/Mataro\n\t") | |||
parser.add_argument("--output-metadata", | |||
default=str( base / "data/gisaid/metadata.tsv" ), | |||
default=str( base / "data/gisaid/metadata.tsv" ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's several whitespace-only changes in the diff for this file which would be better split out in a separate commit. (I assume your editor made them automatically upon save.)
@ivan-aksamentov I'm sure your comment was written in good faith, but it feels to me a bit too dismissive and flippant for a CR discussion. IMO, it's ok to say "Let's not do this at all" in a CR, but I think it's important to err on the side of treading more carefully with the way in which it's said. (Or said another way, fewer "hot takes" and more "considered takes", please.)
I broadly agree with the sentiment that "faster is better", but I'll also note that by my count, while it shuffles some (de)compression steps around, on net this PR introduces no new compressions and only one new decompression: the reading in of It's not clear to me if this ends up adding to overall runtime or not. I think it'll largely depend on if the processing speed of
I agree infra changes play an important role here, but I think there's room for both infra to be adjusted and for code to be made more space-conscious. In terms of storage infra, I think the highest impact change to our Batch nodes might be making use of instance-local NVMe storage (via LVM pools) instead of further-enlarged EBS volumes for the container workloads. This is something I've been meaning to do for ages, and your comments spurred me to look into it again today and get a proof-of-concept working. I'll try to get that documented and deployed later this week or next. |
Yeah, my comment ended up being super rude. Not sure how it happened and what triggered it. I did not realize when I wrote, but I feel really bad after re-reading it now and being called out. I agree, this is too hot compared to what we usually do here. I am sorry to all readers, especially @tsibley @joverlee521. |
Thanks, Ivan. 🙏 Communication is always a hard thing, and none of us always get it right. FWIW, I do think there's plenty of room to disagree on approach here. |
Skip the compression with `gzip` or `xz` if the source file is already a compressed `.gz` or `.xz` file.
Skip the decompression with `gzip` or `xz` if the destination file is expecting the compressed `gz` or `.xz` file.
Use the `lzma` module to read the compressed `gisaid.ndjson.xz` file so that we do not have to keep the uncompressed file on disk.
Previous commits updated the various scripts called within `ingest-gisaid` to support the compressed `gisaid.ndjson.xz` file. Update `ingest-gisaid` to use `gisaid.ndjson.xz` instead of the uncompressed `gisaid.ndjson` file.
Use the `lzma` module to read the compressed input FASTA so that we do not have to take up additional disk space with the uncompressed FASTA.
934a2b4
to
7e4ff36
Compare
Addresses the disk space limits that we are hitting in the GISAID ingest and the full Nextclade runs by keeping the NDJSON and FASTA files compressed. The
.xz
files are opened with the lzma module to decompress the data as they are streaming. See commits for more details.