-
Notifications
You must be signed in to change notification settings - Fork 1
Add relatedness test workflow #29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,45 @@ | ||
| # REQUIRED: reference + resources | ||
| ref_fasta: /path/to/GRCh38.fa | ||
|
|
||
| somalier: | ||
| sites_vcf: /path/to/somalier-sites-hg38.vcf.bgz # common 0.5-MAF SNP panel (bgzipped + .tbi) | ||
| # If you only have BAM/CRAMs, Somalier will extract from BAM with --fasta {ref_fasta} | ||
| # If you have VCFs with germline GTs, Somalier will extract from VCF. | ||
| genome_build: GRCh38 | ||
|
|
||
| picard: | ||
| # HAPLOTYPE_MAP from GATK resource bundle (haplotype map for fingerprinting) | ||
| haplotype_map: /path/to/haplotype_map_hg38.vcf.gz | ||
|
|
||
| conpair: | ||
| # Conpair site lists (use hg38 bundle) | ||
| snp_positions_bed: /path/to/Conpair/snp_positions_hg38.bed | ||
| common_sites_vcf: /path/to/Conpair/common_snps_hg38.vcf.gz | ||
| min_baseq: 20 | ||
| min_mapq: 10 | ||
|
|
||
| # INPUTS (either BAM/CRAM or VCF per sample; you can mix) | ||
| samples: | ||
| # sample_id: {bam: "..."} OR {vcf: "..."} | ||
| HG001_N1: {bam: /data/HG001/normal1.bam} | ||
| HG001_N2: {bam: /data/HG001/normal2.bam} | ||
| PT1_T: {bam: /data/PT1/tumor.bam} | ||
| PT1_N: {bam: /data/PT1/normal.bam} | ||
| FATHER: {vcf: /data/family/joint.vcf.gz} # joint or singleton VCF allowed | ||
| MOTHER: {vcf: /data/family/joint.vcf.gz} | ||
| CHILD: {vcf: /data/family/joint.vcf.gz} | ||
|
|
||
| # EXPECTED RELATIONSHIPS (used by final report for assertions) | ||
| # relationship ∈ {identical, duplicate, tumor_normal, parent_child, siblings, unrelated} | ||
| expected: | ||
| - {relationship: identical, samples: [HG001_N1, HG001_N2]} | ||
| - {relationship: tumor_normal, samples: [PT1_T, PT1_N]} | ||
| - {relationship: parent_child, samples: [FATHER, CHILD]} | ||
| - {relationship: parent_child, samples: [MOTHER, CHILD]} | ||
| - {relationship: siblings, samples: [FATHER, MOTHER], note: "should NOT be siblings -> expect fail"} # example negative check | ||
|
|
||
| # OPTIONAL: run peddy on a joint VCF (pedigree checks) | ||
| peddy: | ||
| enabled: false | ||
| joint_vcf: /data/family/joint.vcf.gz | ||
| ped: /data/family/family.ped |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| name: conpair | ||
| channels: [conda-forge, bioconda] | ||
| dependencies: | ||
| - python=3.9 | ||
| - samtools | ||
| - pysam | ||
| - pandas | ||
| - numpy | ||
| - pip | ||
| - pip: | ||
| - git+https://github.com/nygenome/Conpair.git |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| name: peddy | ||
| channels: [conda-forge, bioconda] | ||
| dependencies: | ||
| - peddy | ||
| - cython | ||
| - pandas | ||
| - python>=3.9 |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| name: picard | ||
| channels: [conda-forge, bioconda] | ||
| dependencies: | ||
| - picard=3.* | ||
| - openjdk=17 |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| name: relatedness-report | ||
| channels: [conda-forge, bioconda] | ||
| dependencies: | ||
| - python>=3.9 | ||
| - pandas | ||
| - numpy | ||
| - jinja2 | ||
| - pyyaml |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| name: somalier | ||
| channels: [conda-forge, bioconda] | ||
| dependencies: | ||
| - somalier=0.2* | ||
| - htslib | ||
| - bcftools | ||
| - samtools | ||
| - python>=3.9 |
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,203 @@ | ||||||||||||||||||||
| import sys, os, yaml, pandas as pd, numpy as np | ||||||||||||||||||||
| from jinja2 import Template | ||||||||||||||||||||
|
|
||||||||||||||||||||
| som_pairs = snakemake.input.som_pairs | ||||||||||||||||||||
| som_groups = snakemake.input.som_groups | ||||||||||||||||||||
| picard_metrics = snakemake.input.picard_metrics | ||||||||||||||||||||
| picard_matrix = snakemake.input.picard_matrix | ||||||||||||||||||||
| conpair_files = snakemake.input.get("conpair", []) | ||||||||||||||||||||
| out_tsv = snakemake.output.tsv | ||||||||||||||||||||
| out_html = snakemake.output.html | ||||||||||||||||||||
| cfg_path = snakemake.params.cfg | ||||||||||||||||||||
|
|
||||||||||||||||||||
| with open(cfg_path) as fh: | ||||||||||||||||||||
| CFG = yaml.safe_load(fh) | ||||||||||||||||||||
|
|
||||||||||||||||||||
| # ---------------- Somalier ---------------- | ||||||||||||||||||||
| sp = pd.read_csv(som_pairs, sep="\t") | ||||||||||||||||||||
| # Ensure consistent column presence | ||||||||||||||||||||
| # expected columns include: sample_a, sample_b, relatedness, ibs0, ibs2, hom_concordance, het_concordance, etc. | ||||||||||||||||||||
| sp_cols = {c.lower(): c for c in sp.columns} | ||||||||||||||||||||
| def col(name): return sp_cols.get(name, name) | ||||||||||||||||||||
|
|
||||||||||||||||||||
| # Fast lookup for any pair (unordered) | ||||||||||||||||||||
| def key(a,b): return tuple(sorted([a,b])) | ||||||||||||||||||||
| sp["pair_key"] = [key(a,b) for a,b in zip(sp[col("sample_a")], sp[col("sample_b")])] | ||||||||||||||||||||
| sp_pairs = {k: row for k, row in sp.set_index("pair_key").iterrows()} | ||||||||||||||||||||
|
|
||||||||||||||||||||
| # ---------------- Picard Crosscheck ---------------- | ||||||||||||||||||||
| # metrics.txt has columns like: LEFT_FILE, RIGHT_FILE, RESULT, LOD_SCORE, etc. | ||||||||||||||||||||
| pm = pd.read_csv(picard_metrics, sep="\t", comment="#") | ||||||||||||||||||||
| # map file path -> sample name from config | ||||||||||||||||||||
| path2sample = {} | ||||||||||||||||||||
| for s,ent in CFG["samples"].items(): | ||||||||||||||||||||
| p = ent.get("bam") or ent.get("vcf") | ||||||||||||||||||||
| if p: path2sample[os.path.abspath(p)] = s | ||||||||||||||||||||
|
|
||||||||||||||||||||
| def file2sample(p): | ||||||||||||||||||||
| a = os.path.abspath(p) | ||||||||||||||||||||
| # Picard sometimes normalizes paths; try basename fallback | ||||||||||||||||||||
| if a in path2sample: return path2sample[a] | ||||||||||||||||||||
| base = os.path.basename(a) | ||||||||||||||||||||
| for k,v in path2sample.items(): | ||||||||||||||||||||
| if os.path.basename(k)==base: return v | ||||||||||||||||||||
| return base | ||||||||||||||||||||
|
|
||||||||||||||||||||
| pm["left_samp"] = pm["LEFT_FILE"].map(file2sample) | ||||||||||||||||||||
| pm["right_samp"] = pm["RIGHT_FILE"].map(file2sample) | ||||||||||||||||||||
| pm["pair_key"] = [key(a,b) for a,b in zip(pm["left_samp"], pm["right_samp"]) ] | ||||||||||||||||||||
|
|
||||||||||||||||||||
| # Collapse to best (max absolute LOD) per pair | ||||||||||||||||||||
| pm_best = pm.loc[pm.groupby("pair_key")["LOD_SCORE"].apply(lambda s: s.abs().idxmax())] | ||||||||||||||||||||
|
|
||||||||||||||||||||
| picard_pairs = {k: row for k,row in pm_best.set_index("pair_key").iterrows()} | ||||||||||||||||||||
|
|
||||||||||||||||||||
| # ---------------- Conpair (tumor/normal) ---------------- | ||||||||||||||||||||
| con_df = [] | ||||||||||||||||||||
| for f in conpair_files: | ||||||||||||||||||||
| if not os.path.exists(f): continue | ||||||||||||||||||||
| try: | ||||||||||||||||||||
| df = pd.read_csv(f, sep="\t") | ||||||||||||||||||||
| except Exception: | ||||||||||||||||||||
| # Some versions write CSV-like; be lenient | ||||||||||||||||||||
| df = pd.read_csv(f) | ||||||||||||||||||||
| df.columns = [c.lower() for c in df.columns] | ||||||||||||||||||||
|
||||||||||||||||||||
| df.columns = [c.lower() for c in df.columns] |
Copilot
AI
Sep 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code assumes the directory structure always contains exactly two elements when split on '__', but doesn't handle cases where the split results in fewer than 2 elements, which would cause an IndexError.
| df["sample_a"] = pair[0] | |
| df["sample_b"] = pair[1] | |
| if len(pair) >= 2: | |
| df["sample_a"] = pair[0] | |
| df["sample_b"] = pair[1] | |
| else: | |
| # Fallback: assign the whole name to sample_a, empty string to sample_b | |
| df["sample_a"] = pair[0] | |
| df["sample_b"] = "" |
Copilot
AI
Sep 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition pic[\"lod\"] < 0 and pic[\"lod\"] <= THRESH[\"picard_mismatch_lod\"] is redundant since THRESH["picard_mismatch_lod"] is -5.0. The first condition < 0 is always true when the second condition <= -5.0 is true. Simplify to just pic[\"lod\"] <= THRESH[\"picard_mismatch_lod\"].
| if pic is not None and pic["lod"] < 0 and pic["lod"] <= THRESH["picard_mismatch_lod"]: | |
| if pic is not None and pic["lod"] <= THRESH["picard_mismatch_lod"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
use rule ... with:syntax should useinput:instead ofwildcards:for expanding over multiple wildcard combinations. The current syntax is incorrect and will likely cause Snakemake execution errors.