Skip to content

A: Genotyping variation nested inside of bubbles

Jana Ebler edited this page Aug 22, 2023 · 4 revisions

Any VCF following the format described here can be used as input to PanGenie in order to genotype bubbles in the pangenome graph. However, in many cases a bubble in the graph often does not represent a single variant but rather is a combination of many individual variants present in the haplotypes in the corresponding genomic region. In other words, bubbles often contain many nested variant alleles. In order to derive genotypes for variant alleles nested inside of graph bubbles, we typically produce PanGenie input VCFs containing special annotations encoding a decomposition of graph bubbles into the individual variant alleles they are composed of. After genotyping the bubbles with PanGenie, these annotations can be used to translate bubble genotypes to genotypes for these nested alleles. For this purpose, our pipelines producing PanGenie-ready VCFs always produce two VCF files: a multi-allelic graph-VCF representing bubbles in the graph (PanGenie input VCF) and a bi-allelic callset-VCF defining all the individual variant alleles nested inside of the graph bubbles.

In the multi-allelic graph-VCF (top in Figure above), each record represents a bubble in the graph and lists all paths covered by at least one haplotypes as the alternative allele sequences. Each such alternative allele is annotated by a sequence of variant IDs (separated by a colon) in the INFO field, indicating which individual variant alleles it is composed of (since bubbles are usually composed of many individual variant alleles). The bi-allelic callset-VCF (bottom in Figure above) contains one separate record for each such variant ID. Both VCFs describe the same genetic variation, but using different ways of representation. The graph-VCFs are used as input to PanGenie for genotyping. Using the annotations, the resulting bubble genotypes can be translated into genotypes for each individual variant ID using the callset-VCF. This enables properly analyzing variant alleles contained inside of bubbles. How we produce these annotations depends on the data. For HPRC graphs, for example, the annotations are computed by analyzing allele traversals in the graph.

Note that this decomposition procedure is useful in many cases, but PanGenie can still be run on VCFs not containing these special annotations. In fact, the annotations are not used by PanGenie. They just enable an additional postprocessing step which helps analyzing variation encoded inside of bubbles.

For VCFs following the format described in this section, these commands can be used for genotyping:

# run PanGenie (v3.0.0) preprocessing
PanGenie-index -v <graph-vcf> -r <reference-genome> -t 24 -o index

# run PanGenie (v3.0.0) on a specific sample (using 24 cores), produces genotyped VCF "pangenie_genotyping.vcf".
# to genotype multiple samples, run this command on each sample separately. PanGenie-index needs to be run only once.
PanGenie -f index -i <input-reads> -o pangenie -j 24 -t 24



# decompose bubbles and produce a bi-allelic VCF with genotypes for each (nested) allele
cat pangenie_genotyping.vcf | python3 convert-to-biallelic.py <callset-VCF> > pangenie_genotyping_biallelic.vcf

The script convert-to-biallelic.py is provided here.

The first step is running PanGenie, and the second step uses the annotations in the VCFs to translate bubble genotypes to genotypes for all variant alleles. Thus, the final VCF contains exactly the same records as the callset-VCF, just with genotypes added for all these variants.