This is a fork of https://gitlab.com/bacazaux/haploblocks bacazaux's gitlab repository. No original license is given so any credit of the original work goes to https://gitlab.com/bacazaux/ . Forking this project to handle GAPS on a SNPs multiallelic matrix.
Only the PBWT part is forked, hence everything else is removed with a commit.
Given a matrix_file of k sequences, each of length n, over the alphabet {0,1}, find all maximal haplotype blocks (exact repeats starting at the same index position in two or more sequences) of size (height x width) at least b.
Each maximal haplotype block is represented by a line
size (start-end,witness:#sequences)
where
- start: the first column of the block
- end: the last column of the block
- witness: a line corresponding to this block
- #sequences: number of sequences of the block
- size: the size of the block ( = (end-start+1) x #sequences)
For the following sequences
1 1 1 0 0 1 1 0 1 0 1 0 1 1 0
0 0 1 0 0 0 1 1 0 1 0 1 0 0 0
0 0 1 0 0 1 1 1 0 1 0 1 0 0 0
0 1 0 0 0 1 0 0 1 1 1 0 1 0 0
We have k = 4 and n = 15. For b = 10, we have 3 maximal haplotype blocks:
- 10 (0-4,1:2)
1 1 1 0 0 1 1 0 1 0 1 0 1 1 0
0 0 1 0 0 0 1 1 0 1 0 1 0 0 0
0 0 1 0 0 1 1 1 0 1 0 1 0 0 0
0 1 0 0 0 1 0 0 1 1 1 0 1 0 0
- 10 (2-6,2:2)
1 1 1 0 0 1 1 0 1 0 1 0 1 1 0
0 0 1 0 0 0 1 1 0 1 0 1 0 0 0
0 0 1 0 0 1 1 1 0 1 0 1 0 0 0
0 1 0 0 0 1 0 0 1 1 1 0 1 0 0
- 18 (6-14,1:2)
1 1 1 0 0 1 1 0 1 0 1 0 1 1 0
0 0 1 0 0 0 1 1 0 1 0 1 0 0 0
0 0 1 0 0 1 1 1 0 1 0 1 0 0 0
0 1 0 0 0 1 0 0 1 1 1 0 1 0 0
Compiling
# "Go to the directory"
cd trie
# "Compile"
make
Find blocks
./haploblocks <matrix_file> <b>
Example
./haploblocks ../tests/example-small.bm 10
This python script implements the simplest solution ever (?).
- Addition of distinct flags each 'b' positions in all sequences: thus the same flag occurs at the same position in all sequences
- Concatenation of all these decorated sequences in a sequence s
- Creation of the suffix array of s (including the LCP)
- Detection of maximal repeats of size at least b
Find blocks
python haplotype-sa.py -f <matrix_file> -b <b> -o <outfile>
# "Go to the directory"
cd direct
# "Compile"
make
Find block
./hap -f <matrix_file> -b <b> -m <row_rise_type>
where row_rise_type corresponds how you want to read your data column by column.
- mmap: mmap functions
- precomp: precompute all the data in the ram
- seek: use seek and get
- col:
Example
./hap -f ../tests/example-small.bm -b 10
Compiling
# "Go to the directory"
cd pbwt
# "Compile"
make
Find block
./haplotype-pbwt-lite -f <matrix_file> -b <b> -n <buffer_size>
where buffer_size corresponds to the size of buffer of each file. The maximum number of file that you need corresponds to x = k+1, you can increase the number with
ulimit -n <x>
Example
./haplotype-pbwt-lite -f ../tests/example-small.bm -b 10
Compiling
# "Go to the directory"
cd vcf2bm
# "Compile"
make
Convert a vcf file (e.g. from 1000 Genomes Project at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) to a binary matrix
./vcf2bm -v <vcf_file> -o <output_file> [-a at_once] [-h]
Generating file
# "Go to the directory"
cd generate
# "Generate file"
python gen_haplotype.py Directory/ m n error_rate
Example
python gen_haplotype.py ../tests/ 4 20 1
# Compile all the c++ files
make
# Test all the implementations
make test