Elixir BioHackathon November 3-7, 2025
This pipeline generates recombination-defined genomic hashes, unique identifiers that contain haplotype-resolved variant information. Every hash is a unique representation of an individual's genotype with a distinct combination of variants segragated across haploblocks. We also provide an example model for a hash-based genotype-to-phenotype mapping.
git clone https://github.com/collaborativebioinformatics/Haploblock_Clusters_ElixirBH25.git
cd Haploblock_Clusters_ElixirBH25/
To create genomic hashes you can run the pipeline using Docker or step-by-step, as described below:
Build the Docker image (example data will be automatically downloaded into data/ inside the container):
docker build -t haploblock-pipeline .
Run the Docker container interactively:
docker run -it --rm \
-v host_mount_point/data:/app/Haploblock_Clusters_ElixirBH25/data \
haploblock-pipeline
Once inside the container, the pipeline can be run with:
cd haploblock_pipeline
python3 main.py --config config/default.yaml [optional arguments]
The default config file is in config/default.yaml. It defines input paths, pipeline settings, and parallelization parameters.
Optional arguments:
--threads <N> : Number of threads to use (default: available CPUs minus 1)
--step <1|2|3|4|5|all> : Execute only a specific pipeline step (default: 'all')
The pipeline can be also run in the non-interactive mode:
docker run --rm \
-v host_mount_point/data:/app/Haploblock_Clusters_ElixirBH25/data \
-v host_mount_point/config/default.yaml:/app/Haploblock_Clusters_ElixirBH25/config/default.yaml \
haploblock-pipeline \
python3 main.py --config config/default.yaml
For Python dependencies see requirements.txt. For other dependencies and how to install them see install_dependencies.md. Follow the instructions carefully: the pipeline requires samtools, bcftools, htslib (see https://www.htslib.org/) and MMSeqs2 (https://github.com/soedinglab/MMseqs2), all must be simlinked in /usr/bin or exported to PATH.
All data listed below must be downloaded into data/:
cd data/
- Recombination map from Halldorsson et al., 2019 with empirically defined genome-wide recombination rates:
## Download https://www.science.org/doi/suppl/10.1126/science.aau1043/suppl_file/aau1043_datas3.gz
## Upload the file to the data/ directory
gzip -d aau1043_datas3.gz
File aau1043_datas3 contains averaged maternal and paternal recombination rates.
- 1000Genomes, HGSVC, Phase 3
For the example, we use all populations (2548 samples) and chromosome 6. If you would like to run all chromosomes, download also phased VCF files for other chromosomes.
Phased VCF file for chromosome 6:
wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/ALL.chr6.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz
and index file (same point as about chromosomes):
wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/ALL.chr6.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz.tbi
- Reference sequence for chromosome 6 (GRCh38), must be bgzipped:
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr6.fa.gz
gzip -d chr6.fa.gz
## If you do not have bgzip, install it (see install_dependencies.txt)
bgzip chr6.fa
Optionally for testing, if you want to run the pipeline for one population from 1000Genomes, you will also need a TSV file with the list of samples downloaded in data/, eg. CHB can be downloaded from: https://www.internationalgenome.org/data-portal/population/CHB
python3 haploblock_pipeline/step1_haploblocks.py \
--recombination_file data/Halldorsson2019/aau1043_datas3 \
--chr 6 \
--out out_dir/
This step creates:
- haploblock boundaries: TSV file with header and 2 columns (START END)
For a detailed description of how haplobock boundaries are defined see https://github.com/jedrzejkubica/Elixir-BH-2025 as well as Halldorsson2019.
This step generates a phased FASTA file for each individual and each haploblock by extracting sequence segments defined by the haploblock boundaries from the input VCF. A temporary directory is created inside --out_dir, where the consensus phased haploblock sequences are stored. It also computes summary statistics on the number of variants per haploblock. These values are written to out_dir/variant_counts.tsv, a four-column table containing START, END, MEAN, STDEV.
NOTE: VCF file has "6" instead of "chr6", which is required by bcftools consensus, therefore create a file chr_map with one chromosome number-to-name mapping per line (e.g., "6 chr6").
Here is the instruction for one haploblock (TNFa) (example):
python haploblock_pipeline/step2_phased_sequences.py \
--boundaries_file data/haploblock_boundaries_chr6_TNFa.tsv \
--vcf data/ALL.chr6.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz \
--ref data/chr6.fa.gz \
--chr_map data/chr_map \
--chr 6 \
--out out_dir/TNFa/
Here is the instruction for all haploblocks:
python haploblock_pipeline/step2_phased_sequences.py \
--boundaries_file out_dir/haploblock_boundaries_chr6.tsv \
--vcf data/ALL.chr6.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz \
--ref data/chr6.fa.gz \
--chr_map data/chr_map \
--chr 6 \
--out out_dir/
Optionally, if you want to run it for one population, use a TSV file with samples from 1000Genomes (data/igsr-chb.tsv.tsv):
python haploblock_pipeline/step2_phased_sequences.py [...] --samples data/igsr-chb.tsv.tsv
This step produces a single merged phased FASTA file for each haploblock with sequences of all individuals.
Here is the instruction for one haploblock (TNFa):
haploblock_pipeline/step3_merge_fasta.py out_dir/TNFa/tmp/consensus_fasta out_dir/TNFa/haploblock_phased_seq_merged
## Remove the temporary directory out_dir/TNFa/tmp/ directory if you want
Here is the instruction for all haploblocks:
haploblock_pipeline/step3_merge_fasta.py out_dir/tmp/consensus_fasta out_dir/haploblock_phased_seq_merged
## Remove the temporary directory out_dir/tmp/ if you want
This step computes haploblock clusters using MMSeqs2. Make sure MMSeqs2 is available in PATH.
Here is the instruction for one haploblock (TNFa):
python haploblock_pipeline/step4_clusters.py \
--boundaries_file data/haploblock_boundaries_chr6_TNFa.tsv \
--merged_consensus_dir out_dir/TNFa/haploblock_phased_seq_merged \
--variant_counts out_dir/TNFa/variant_counts.tsv \
--chr 6 \
--out out_dir/TNFa/
Here is the instruction for all haploblocks:
python haploblock_pipeline/step4_clusters.py \
--boundaries_file out_dir/haploblock_boundaries_chr6.tsv \
--merged_consensus_dir out_dir/haploblock_phased_seq_merged \
--variant_counts out_dir/variant_counts.tsv \
--chr 6 \
--out out_dir/
This step uses the merged phased sequences (--merged_consensus_dir) and variant statistics (--variant_counts) to derive MMSeqs2 parameters: minimum sequence identity and coverage thresholds. It produces one cluster table per haploblock, saved in out_dir/clusters/.
Here is the instruction for one haploblock (TNFa):
python haploblock_pipeline/step5_variant_hashes.py \
--boundaries_file data/haploblock_boundaries_chr6_TNFa.tsv \
--clusters out_dir/TNFa/clusters/chr6_31480875-31598421_cluster.tsv \
--chr 6 \
--out out_dir/TNFa/
If you are not interested in SNPs, but only in haploblock clusters, you can stop here. If you want to compare these clusters to external sequence groups, you can extract cluster representatives from the merged FASTA files generated earlier (step 2). Optionally, if you want to run it for SNPs and/or one population use optional arguments:
python haploblock_pipeline/step5_variant_hashes.py [...] \
--variants data/variants_of_interest.txt \
--vcf data/ALL.chr6.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz \
--samples data/igsr-chb.tsv.tsv
Here is the instruction for all haploblocks, please run it separately for each cluster TSV file:
python haploblock_pipeline/step5_variant_hashes.py \
--boundaries_file data/haploblock_boundaries_chr6.tsv \
--clusters out_dir/clusters/cluster_file.tsv \
--chr 6 \
--out out_dir/
This step generates variants hashes, 64-character strings of 0/1s. Each individual hash contains:
- strand hash: 4 characters
- chromosome hash: 10 characters
- haploblock hash: 20 characters
- cluster hash: 20 characters
- variant hash: the number of SNPs of interest (optional)
This step generates:
- haploblock hashes: TSV file with header and 3 columns (START\tEND\tHASH)
- cluster hashes: TSV file with header and 2 columns (CLUSTER\tHASH)
- variant hashes (optional): TSV file with header and 2 columns (VARIANT\tHASH)
- individual hashes: TSV file with header and 2 columns (INDIVUDIAL\tHASH)
A simple model to relate the phenotype
where-
and
The parameters
This formulation allows each haploblock to contribute through a learnable haploblock-specific projection vector
We found 1399 haploblocks in chromosome 6. See haploblock_boundaries_chr6.tsv for these haploblock boundaries (high recombination rates are selected based on Gaussian smoothing).
We generated haploblock phased sequences (format: sample_chr_region_start-end_hap0/1.fa) for the CBH, PUR and GBR populations for the following genomic regions:
- 10 random haploblocks of chromosome 6
- 5 random haploblocks of chromosome 6
- the haploblocks overlapping with TNFa
- the haploblocks overlapping with genes related to human height
- haploblock phased sequences and haploblock hashes for TNFa for all populations
During the hackathon we ran the pipeline on a Linux-based machine with 8 CPU cores and 16 GB of memory.
This work was supported by ELIXIR, the research infrastructure for life science data, and conducted at the ELIXIR BioHackathon Europe.
-
Palsson, G., Hardarson, M.T., Jonsson, H. et al. Complete human recombination maps. Nature 639, 700–707 (2025). https://doi.org/10.1038/s41586-024-08450-5
-
Bjarni V. Halldorsson et al., Characterizing mutagenic effects of recombination through a sequence-level genetic map. Science363,eaau1043 (2019). DOI:10.1126/science.aau1043
-
Yu-Hsiang Tseng, Sumit Walia, Yatish Turakhia, "Ultrafast and ultralarge multiple sequence alignments using TWILIGHT", Bioinformatics, Volume 41, Issue Supplement_1, July 2025, Pages i332–i341, doi: 10.1093/bioinformatics/btaf212