-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tweak extract values based on Echo Runs [VS-1432] #8979
Conversation
@@ -172,6 +172,7 @@ You can take advantage of our existing sub-cohort WDL, `GvsExtractCohortFromSamp | |||
- Specify the same `call_set_identifier`, `dataset_name`, `project_id`, `extract_table_prefix`, and `interval_list` that were used in the `GvsPrepareRangesCallset` run documented above. | |||
- Specify the `interval_weights_bed` appropriate for the PGEN extraction run you are performing. `gs://gvs_quickstart_storage/weights/gvs_full_vet_weights_1kb_padded_orig.bed` is the interval weights BED used for Quickstart. | |||
- Select the workflow option "Retry with more memory" and choose a "Memory retry factor" of 1.5 | |||
- Set the `extract_maxretries_override` input to 5, `split_intervals_disk_size_override` to 1000, `scatter_count` to 25000, and `y_bed_weight_scaling` to 8 to start; you will likely have to adjust one or more of these values in subsequent attempts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this all looks good--esp since you are using real world experience from Echo
Do we have any record of the thought process behind this? I know at one point there was a doc?
anyway, LGTM but I'm def eluded by why these numbers are ideal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Off the top of my head:
split_intervals_disk_size_override
to 1000 because otherwise we run out of disk space 😭scatter_count
- this one I think may vary depending on the interval list? Pretty sure the default 34K is almost guaranteed to give us Cromwell problems (and maybe that should be changed)y_bed_weight_scaling
we were doing both X and Y at a scale factor of 4 and Y shards were still the laggards. We might even go higher than 8 but we didn't actually try that during Echo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Seems superficially odd that we ultimately want to reduce the disk size for pgen on larger callsets, but I trust the results of your analysis
Disk size is less of a concern because (at least with GCP historically) going too low on disk size risks much slower I/O without commiserate savings. |
Yes, that's also what I thought as well. That's why it seemed a little odd that this PR included a change that lowered the default disk size on the VMs if they weren't specified from 500 to 200. I always thought going too low was the worry, and that disk size in general isn't much of a concern. But this PR appears to lower the default disk size by 60%, unless I am reading that completely wrong. |
The logs showed a max of 7% disk space, so it seemed fair to reduce it somewhat. But your comment reminded me that I also meant to adjust the |
No VCF extract changes? |
It looked like running it with the 1.5 retry with more memory did the trick, unless you have other info? |
No description provided.