DISCLAIMER

This is a fork of https://gitlab.com/bacazaux/haploblocks bacazaux's gitlab repository. No original license is given so any credit of the original work goes to https://gitlab.com/bacazaux/ . Forking this project to handle GAPS on a SNPs multiallelic matrix.

Only the PBWT part is forked, hence everything else is removed with a commit.

Finding all maximal perfect haplotype blocks

Problem definition

Given a matrix_file of k sequences, each of length n, over the alphabet {0,1}, find all maximal haplotype blocks (exact repeats starting at the same index position in two or more sequences) of size (height x width) at least b.

Output

Each maximal haplotype block is represented by a line

size (start-end,witness:#sequences)

where

start: the first column of the block
end: the last column of the block
witness: a line corresponding to this block
#sequences: number of sequences of the block
size: the size of the block ( = (end-start+1) x #sequences)

Example

For the following sequences

1 1 1 0 0 1 1 0 1 0 1 0 1 1 0
0 0 1 0 0 0 1 1 0 1 0 1 0 0 0
0 0 1 0 0 1 1 1 0 1 0 1 0 0 0
0 1 0 0 0 1 0 0 1 1 1 0 1 0 0

We have k = 4 and n = 15. For b = 10, we have 3 maximal haplotype blocks:

10 (0-4,1:2)

1 1 1 0 0 1 1 0 1 0 1 0 1 1 0
0 0 1 0 0 0 1 1 0 1 0 1 0 0 0
0 0 1 0 0 1 1 1 0 1 0 1 0 0 0
0 1 0 0 0 1 0 0 1 1 1 0 1 0 0

10 (2-6,2:2)

1 1 1 0 0 1 1 0 1 0 1 0 1 1 0
0 0 1 0 0 0 1 1 0 1 0 1 0 0 0
0 0 1 0 0 1 1 1 0 1 0 1 0 0 0
0 1 0 0 0 1 0 0 1 1 1 0 1 0 0

18 (6-14,1:2)

1 1 1 0 0 1 1 0 1 0 1 0 1 1 0
0 0 1 0 0 0 1 1 0 1 0 1 0 0 0
0 0 1 0 0 1 1 1 0 1 0 1 0 0 0
0 1 0 0 0 1 0 0 1 1 1 0 1 0 0

Algorithms and implementations

Trie algorithm

Compiling

# "Go to the directory"
cd trie
# "Compile"
make

Find blocks

./haploblocks <matrix_file> <b>

Example

./haploblocks ../tests/example-small.bm 10

Suffix Array algorithm

This python script implements the simplest solution ever (?).

Addition of distinct flags each 'b' positions in all sequences: thus the same flag occurs at the same position in all sequences
Concatenation of all these decorated sequences in a sequence s
Creation of the suffix array of s (including the LCP)
Detection of maximal repeats of size at least b

Find blocks

python haplotype-sa.py -f <matrix_file> -b <b> -o <outfile>

Direct algorithm

# "Go to the directory"
cd direct
# "Compile"
make

Find block

./hap -f <matrix_file> -b <b> -m <row_rise_type>

where row_rise_type corresponds how you want to read your data column by column.

mmap: mmap functions
precomp: precompute all the data in the ram
seek: use seek and get
col:

Example

./hap -f ../tests/example-small.bm -b 10

pBWT algorithm

Compiling

# "Go to the directory"
cd pbwt
# "Compile"
make

Find block

./haplotype-pbwt-lite -f <matrix_file> -b <b> -n <buffer_size>

where buffer_size corresponds to the size of buffer of each file. The maximum number of file that you need corresponds to x = k+1, you can increase the number with

ulimit -n <x>

Example

./haplotype-pbwt-lite -f ../tests/example-small.bm -b 10

Convert VCF file to binary matrix

Compiling

# "Go to the directory"
cd vcf2bm
# "Compile"
make

Convert a vcf file (e.g. from 1000 Genomes Project at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) to a binary matrix

./vcf2bm -v <vcf_file> -o <output_file> [-a at_once] [-h]

Generate data

Generating file

# "Go to the directory"
cd generate
# "Generate file"
python gen_haplotype.py Directory/ m n error_rate

Example

python gen_haplotype.py ../tests/ 4 20 1

General Makefile

# Compile all the c++ files
make
# Test all the implementations
make test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DISCLAIMER

Finding all maximal perfect haplotype blocks

Problem definition

Output

Example

Algorithms and implementations

Trie algorithm

Suffix Array algorithm

Direct algorithm

pBWT algorithm

Convert VCF file to binary matrix

Generate data

General Makefile

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
direct		direct
generate		generate
pbwt		pbwt
suffix_arrays		suffix_arrays
tests		tests
vcf2bm		vcf2bm
Makefile		Makefile
README.md		README.md

msgr0/g-haploblocks

Folders and files

Latest commit

History

Repository files navigation

DISCLAIMER

Finding all maximal perfect haplotype blocks

Problem definition

Output

Example

Algorithms and implementations

Trie algorithm

Suffix Array algorithm

Direct algorithm

pBWT algorithm

Convert VCF file to binary matrix

Generate data

General Makefile

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages