-
Notifications
You must be signed in to change notification settings - Fork 3
Rewrite workflow outputs using gcloud #31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the workflow to use gcloud for all file outputs instead of Hail's write operations, and disables gcloud's multipart uploads. This allows the workflow to run with standard service account permissions instead of requiring full permissions, enabling safer automated cron job execution.
Key Changes
- Introduced a new utility function
make_me_a_jobthat standardizes job creation and disables parallel composite uploads for gcloud - Replaced all
batch_instance.write_output()calls with directgcloud storage cpcommands in job definitions - Updated bcftools from version 1.21 to 1.22
- Version bumped to 2.2.6 across all configuration files
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
src/clinvarbitration/cpg_internal/utils.py |
New utility function to create jobs with gcloud configuration |
src/clinvarbitration/jobs/GenerateNewClinvarSummary.py |
Refactored to use gcloud for output copying instead of batch write_output |
src/clinvarbitration/jobs/Pm5TableGeneration.py |
Changed to manually copy outputs via gcloud, removed write_output |
src/clinvarbitration/jobs/AnnotateClinvarSnvsWithBcftools.py |
Updated to use gcloud for output, added chromosome name unification option |
src/clinvarbitration/jobs/CopyLatestClinvarFiles.py |
Streamlined to pipe wget output directly to gcloud |
src/clinvarbitration/jobs/PackageForRelease.py |
Switched from write_output to manual gcloud copy |
src/clinvarbitration/stages.py |
Updated to handle new output structure and argument passing |
src/clinvarbitration/scripts/resummarise_clinvar.py |
Import order reorganization per new isort config |
src/clinvarbitration/scripts/clinvar_by_codon.py |
Import order reorganization and minor formatting cleanup |
pyproject.toml |
Version bump, pinned cpg-flow version, added custom isort sections |
Dockerfile and src/clinvarbitration/cpg_internal/Dockerfile |
Upgraded bcftools to 1.22, cleaned up duplicate file copy |
.github/workflows/docker.yaml |
Version number update |
src/clinvarbitration/__init__.py |
Version number update |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| gcloud storage cp \\ | ||
| clinvar_decisions.ht.tar \\ | ||
| "${{BATCH_TMPDIR}}/clinvar_decisions.vcf.bgz*" \\ | ||
| ${{BATCH_TMPDIR}}/clinvar_decisions.tsv \\ |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The gcloud storage cp command with wildcards may fail silently if no files match the pattern. If clinvar_decisions.vcf.bgz* doesn't match any files (e.g., if the VCF and index weren't created), the command will succeed without copying those files. Consider adding error checking or using explicit file paths for both clinvar_decisions.vcf.bgz and clinvar_decisions.vcf.bgz.tbi.
| gcloud storage cp \\ | |
| clinvar_decisions.ht.tar \\ | |
| "${{BATCH_TMPDIR}}/clinvar_decisions.vcf.bgz*" \\ | |
| ${{BATCH_TMPDIR}}/clinvar_decisions.tsv \\ | |
| # Check for existence of VCF and index files before copying | |
| if [ ! -f "${{BATCH_TMPDIR}}/clinvar_decisions.vcf.bgz" ]; then | |
| echo "Error: VCF file not found: ${{BATCH_TMPDIR}}/clinvar_decisions.vcf.bgz" >&2 | |
| exit 1 | |
| fi | |
| if [ ! -f "${{BATCH_TMPDIR}}/clinvar_decisions.vcf.bgz.tbi" ]; then | |
| echo "Error: VCF index file not found: ${{BATCH_TMPDIR}}/clinvar_decisions.vcf.bgz.tbi" >&2 | |
| exit 1 | |
| fi | |
| gcloud storage cp \\ | |
| clinvar_decisions.ht.tar \\ | |
| "${{BATCH_TMPDIR}}/clinvar_decisions.vcf.bgz" \\ | |
| "${{BATCH_TMPDIR}}/clinvar_decisions.vcf.bgz.tbi" \\ | |
| "${{BATCH_TMPDIR}}/clinvar_decisions.tsv" \\ |
EddieLF
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Purpose
fullpermissionsVindication! Kinda, I ran this workflow in a few pieces due to a config issue: https://batch.hail.populationgenomics.org.au/batches/1122872
Checklist
bump2version