Skip to content

Conversation

@MattWellie
Copy link
Collaborator

@MattWellie MattWellie commented Dec 3, 2025

Purpose

  • The final form of Avoid hail multipart copy #30 - all Hail writes are manually removed, and gcloud's multipart uploads are disabled. This should mean the whole workflow is safe to run with a standard SA.
  • Kinda important for this workflow, as I'd like to set it off automatically as a cron job, and don't want that to be using full permissions

Vindication! Kinda, I ran this workflow in a few pieces due to a config issue: https://batch.hail.populationgenomics.org.au/batches/1122872

Checklist

  • Related GitHub Issue created
  • Linting checks pass
  • Bump the version number with bump2version

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the workflow to use gcloud for all file outputs instead of Hail's write operations, and disables gcloud's multipart uploads. This allows the workflow to run with standard service account permissions instead of requiring full permissions, enabling safer automated cron job execution.

Key Changes

  • Introduced a new utility function make_me_a_job that standardizes job creation and disables parallel composite uploads for gcloud
  • Replaced all batch_instance.write_output() calls with direct gcloud storage cp commands in job definitions
  • Updated bcftools from version 1.21 to 1.22
  • Version bumped to 2.2.6 across all configuration files

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/clinvarbitration/cpg_internal/utils.py New utility function to create jobs with gcloud configuration
src/clinvarbitration/jobs/GenerateNewClinvarSummary.py Refactored to use gcloud for output copying instead of batch write_output
src/clinvarbitration/jobs/Pm5TableGeneration.py Changed to manually copy outputs via gcloud, removed write_output
src/clinvarbitration/jobs/AnnotateClinvarSnvsWithBcftools.py Updated to use gcloud for output, added chromosome name unification option
src/clinvarbitration/jobs/CopyLatestClinvarFiles.py Streamlined to pipe wget output directly to gcloud
src/clinvarbitration/jobs/PackageForRelease.py Switched from write_output to manual gcloud copy
src/clinvarbitration/stages.py Updated to handle new output structure and argument passing
src/clinvarbitration/scripts/resummarise_clinvar.py Import order reorganization per new isort config
src/clinvarbitration/scripts/clinvar_by_codon.py Import order reorganization and minor formatting cleanup
pyproject.toml Version bump, pinned cpg-flow version, added custom isort sections
Dockerfile and src/clinvarbitration/cpg_internal/Dockerfile Upgraded bcftools to 1.22, cleaned up duplicate file copy
.github/workflows/docker.yaml Version number update
src/clinvarbitration/__init__.py Version number update

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +41 to +44
gcloud storage cp \\
clinvar_decisions.ht.tar \\
"${{BATCH_TMPDIR}}/clinvar_decisions.vcf.bgz*" \\
${{BATCH_TMPDIR}}/clinvar_decisions.tsv \\
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gcloud storage cp command with wildcards may fail silently if no files match the pattern. If clinvar_decisions.vcf.bgz* doesn't match any files (e.g., if the VCF and index weren't created), the command will succeed without copying those files. Consider adding error checking or using explicit file paths for both clinvar_decisions.vcf.bgz and clinvar_decisions.vcf.bgz.tbi.

Suggested change
gcloud storage cp \\
clinvar_decisions.ht.tar \\
"${{BATCH_TMPDIR}}/clinvar_decisions.vcf.bgz*" \\
${{BATCH_TMPDIR}}/clinvar_decisions.tsv \\
# Check for existence of VCF and index files before copying
if [ ! -f "${{BATCH_TMPDIR}}/clinvar_decisions.vcf.bgz" ]; then
echo "Error: VCF file not found: ${{BATCH_TMPDIR}}/clinvar_decisions.vcf.bgz" >&2
exit 1
fi
if [ ! -f "${{BATCH_TMPDIR}}/clinvar_decisions.vcf.bgz.tbi" ]; then
echo "Error: VCF index file not found: ${{BATCH_TMPDIR}}/clinvar_decisions.vcf.bgz.tbi" >&2
exit 1
fi
gcloud storage cp \\
clinvar_decisions.ht.tar \\
"${{BATCH_TMPDIR}}/clinvar_decisions.vcf.bgz" \\
"${{BATCH_TMPDIR}}/clinvar_decisions.vcf.bgz.tbi" \\
"${{BATCH_TMPDIR}}/clinvar_decisions.tsv" \\

Copilot uses AI. Check for mistakes.
Copy link

@EddieLF EddieLF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MattWellie MattWellie merged commit b73d9ba into main Dec 3, 2025
5 checks passed
@MattWellie MattWellie deleted the rewrite_cpg_outputs_using_gcloud branch December 3, 2025 01:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants