Skip to content

B: Creating PanGenie input VCFs from haplotype‐resolved assemblies

Jana Ebler edited this page Aug 22, 2023 · 3 revisions

We have written a pipeline that calls variants from haplotype-resolved assemblies of human samples and generates a graph-VCF to be used as input to PanGenie. This pipeline is available here: https://bitbucket.org/jana_ebler/vcf-merging/src/master/pangenome-graph-from-assemblies/. The pipeline produces two ouput VCFs. A mulit-allelic graph-VCF and a bi-allelic callset-VCF formatted as described in detail in Section Genotyping variation nested inside of bubbles.

The graph-VCF can be used as input to PanGenie to genotype graph bubbles:

# run PanGenie (v3.0.0) preprocessing
PanGenie-index -v <graph-vcf> -r <reference-genome> -t 24 -o index

# run PanGenie (v3.0.0) on a specific sample (using 24 cores), produces genotyped VCF "pangenie_genotyping.vcf".
# to genotype multiple samples, run this command on each sample separately. PanGenie-index needs to be run only once.
PanGenie -f index -i <input-reads> -o pangenie -j 24 -t 24

The callset-VCF can then be used to convert the bubble genotypes into genotypes for all variant alleles nested inside of bubbles:

cat pangenie_genotyping.vcf | python3 convert-to-biallelic.py <callset-VCF> > pangenie_genotyping_biallelic.vcf

The script convert-to-biallelic.py is provided here.