Paper: Prohlatype: A Probabilistic Framework for HLA Typing 1
This project provides a set of tools to calculate the full posterior distribution of HLA types given read data.
Instead of:
A1 A2 B1 B2 C1 C2 Reads Objective
0 A*31:01 A*02:01 B*45:01 B*15:03 C*16:01 C*02:10 538.0 513.79
one can calculate:
Allele 1 | Allele 2 | Log P | P |
---|---|---|---|
A*02:05:01:01 | A*30:114 | -23046.81 | 0.5000 |
A*02:05:01:01 | A*30:01:01 | -23046.81 | 0.5000 |
A*02:05:01:01 | A*30:106 | -23103.15 | 0.0000 |
A*02:05:01:02 | A*30:114 | -23146.35 | 0.0000 |
... | |||
B*07:36 | B*57:03:01:02 | -13717.33 | 0.5000 |
B*07:36 | B*57:03:01:01 | -13717.33 | 0.5000 |
B*07:36 | B*57:03:03 | -13804.74 | 0.0000 |
B*27:157 | B*57:03:01:02 | -13816.17 | 0.0000 |
... | |||
C*06:103 | C*18:10 | -11936.35 | 0.3338 |
C*06:103 | C*18:02 | -11936.36 | 0.3331 |
C*06:103 | C*18:01 | -11936.36 | 0.3331 |
C*15:102 | C*18:02 | -11951.72 | 0.0000 |
-
If you are running on Linux, standalone binaries are available with each release.
-
Use the linked Docker image.
-
Build the software from source:
a. Install opam.
b. Make sure that the opam packages are up to date:
$ opam update
c. Make sure that you're on the relevant compiler:
$ opam switch 4.06.0 $ eval `opam config env`
d. Get source:
$ git clone https://github.com/hammerlab/prohlatype.git prohlatype $ cd prohlatype
e. Install the dependent packages:
$ make setup
f. Build the programs (afterwards they'll be in
_build/default/src/apps
):$ make
Make sure that you have IMGT/HLA available:
$ git clone https://github.com/ANHIG/IMGTHLA.git imgthla
-
Create an imputed HLA reference sequence via
align2fasta
. This step makes sure that all alleles have sequence information that spans the entire locus. This way, reads that originate from a region for which we normally do not have sequence information will still align (in the next filtering step), albeit poorly:$ align2fasta path-to-imgthla/alignments -o imputed_hla_class_I
This step needs to be performed only once, per each IMGT version. Run
$align2fasta --help
for further information. -
Filter your data against the reference, by first aligning. Ex:
$ bwa mem imputed_hla_class_I.fasta ${SAMPLE}.fastq | \ samtools view -F 4 -bT imputed_hla_class_I.fasta -o ${SAMPLE}.bam
While fundamentally, the algorithms here are alignment based. They're too slow to run for all sequences. Sequences that do not originate from the HLA-region would just act as background noice.
-
and then convert aligned reads back to FASTQ:
$ samtools fastq ${SAMPLE}.bam > ${SAMPLE}_filtered.fastq
-
Infer types (see
$ multi_par --help
for further details):$ multi_par path-to-imgthla/aignments ${SAMPLE}_filtered.fastq -o ${SAMPLE}_output.tsv
Note: The script src/scripts/run-example-docker.sh
provides an end-to-end
example of the above. It depends only on docker
, wget
, and git
; it fetches
the data and runs everything in a docker container
(see sh src/scripts/run-example-docker.sh help
).
1: All versions of this software after 0.8.0 incorporate an important coverage likelihood that is not described in the previous paper. At the moment a short addendum describing the approach is in limbo, please contact me by email for a reference.