Skip to content
Jana Ebler edited this page Aug 3, 2023 · 7 revisions

This page will explain in detail how to run PanGenie on different datasets.

THIS PAGE IS CURRENTLY UNDER CONSTRUCTION

Required input VCF

PanGenie expects a directed and acyclic pangenome graph as input (-v option). This graph is represented in terms of a VCF file that needs to have certain properties:

  • multi-sample - it needs to contain haplotype information of at least one known sample
  • fully-phased - haplotype information of the known panel samples are represented by phased genotypes and each sample must be phased in one single block (i.e. from start to end).
  • non-overlapping variants - the VCF represents a pangenome graph. Therefore, overlapping variation must be represented in a single, multi-allelic variant record.

Note especially the third property listed above. See the figure below for an illustration of how overlapping variant alleles need to be represented in the input VCF provided to PanGenie.

We typically generate such VCFs from haplotype-resolved assemblies (see below). However, any VCF with the properties listed above can be used as input. Note again that the haplotypes must be phased into a single phased block. So phased VCFs generated by phasing tools like WhatsHap are not suitable!

Bubble decomposition

If possible, we typically produce PanGenie input VCFs that contain special annotations encoding the decomposition of graph bubbles. In a pangenome graph, bubbles can get very large and often contain many individual variant alleles overlapping between the sample haplotypes. PanGenie is designed to genotype such bubbles in the graph. Whenever we want to analyze the individual variant alleles nested inside of these bubbles however, we need to first to convert the bubble genotypes to genotypes for all these variant alleles. For this purpose, our pipelines producing PanGenie-ready VCFs always produce two VCF files: a multi-allelic graph-VCF representing bubbles in the graph (PanGenie input VCF) and a bi-allelic callset-VCF defining all the individual variant alleles nested inside of the graph bubbles.

In the multi-allelic graph-VCF (top in Figure above), each record represents a bubble in the graph and lists all paths covered by at least one haplotypes as the alternative allele sequences. Each such alternative allele is annotated by a sequence of variant IDs (separated by a colon) in the INFO field, indicating which individual variant alleles it is composed of (since bubbles are usually composed of many individual variant alleles). The bi-allelic callset-VCF (bottom in Figure above) contains one separate record for each such variant ID. Both VCFs describe the same genetic variation, but using different ways of representation. The graph-VCFs are used as input to PanGenie for genotyping. Using the annotations, the resulting bubble genotypes can be translated into genotypes for each individual variant ID using the callset-VCF. This enables properly analyzing variant alleles contained inside of bubbles.

This section describes ways to produce PanGenie-ready input VCF files.

Creating PanGenie input VCFs from haplotype-resolved assemblies

We have written a pipeline that calls variants from haplotype-resolved assemblies of human samples and generates a graph-VCF to be used as input to PanGenie. This pipeline is available here: https://bitbucket.org/jana_ebler/vcf-merging/src/master/pangenome-graph-from-assemblies/. The pipeline produces two ouput VCFs. A mulit-allelic graph-VCF and a bi-allelic callset-VCF formatted as described in section "Bubble decomposition" above.

Creating PanGenie input VCFs from existing callsets

TODO

Running PanGenie on HPRC data

Using PanGenie-ready VCFs produced by HPRC

For the HPRC Minigraph-Cactus graph published in https://doi.org/10.1038/s41586-023-05896-x, we have generated PanGenie-ready VCFs containing haplotype data from 44 human samples (88 haplotypes). VCFs were generated based on GRCh38 and CHM13. They are available at:

Dataset PanGenie input VCF Callset VCF
HPRC-GRCh38 (88 haplotypes) graph-VCF callset-VCF
HPRC-CHM13 (88 haplotypes) graph-VCF callset-VCF

For each VCF, there is two versions. A multi-allelic graph-VCF (second column) representing the pangenome graph that is to be used as input to PanGenie, and a bi-allelic callset-VCF (third column) describing all variant alleles contained in the bubbles of the pangenome graph. The VCFs are formatted as described above.

To run PanGenie using HPRC data, use the the following commands and the VCFs provided in the table above:

# run PanGenie (using 24 cores), produces genotyped VCF "pangenie_genotyping.vcf"
PanGenie -i <input-reads> -v <graph-vcf> -r <reference-genome> -o pangenie -j 24 -t 24


# decompose bubbles and produce a bi-allelic VCF with genotypes for each (nested) allele
cat pangenie_genotyping.vcf | python3 convert-to-biallelic.py <callset-VCF> > pangenie_genotyping_biallelic.vcf

The result is a bi-allelic VCF containing exactly the same variants as the callset-VCF, just with genotypes added that were derived from the PanGenie genotypes computed for all graph bubbles. The VCF can then be used in downstream analyses.

Preparing PanGenie-ready VCFs from Minigraph-Cactus graphs

You can also generate your own PanGenie-ready VCFs from a Minigraph-Cactus graph. What you need in order to do so, is the raw VCFs produced using vg decompose from the graph, as well as the GFA file of the graph itself. For the HPRC MC-graph, these VCFs are available from https://github.com/human-pangenomics/hpp_pangenome_resources/tree/main ("Raw VCF" in section "Minigraph-Cactus").

The pipeline provided here: https://github.com/eblerjana/genotyping-pipelines/tree/main/prepare-vcf-MC can then be used in order to produce a graph-VCF as well as the corresponding callset-VCF in the same format as explained in section "Using PanGenie-ready VCFs produced by HPRC" above.

Running PanGenie on HGSVC data

TODO