Skip to content

FORAlign: Accelerating gap-affine DNA pairwise sequence alignment using FOR-blocks based on FOur Russians approach with linear space complexity

Notifications You must be signed in to change notification settings

wym6912/FORAlign

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FORAlign: Accelerating gap-affine DNA pairwise sequence alignment using t-blocks based on FOur Russians approach with linear space complexity

FORAlign is a library written in C++17 (C++20 for some features) for speeding up Hirschberg algorithm using Four-Russians approach in multithreads. It runs on Linux and Windows.

Test Environment

In order to test our methods, we use the benchmark program developed by WFA2, and we developed two programs to test our two methods: the align_benchmark program has tested russians-multi methods, and the multiswg_benchmark has tested the condition-variable method. The usage for align_benchmark is shown as follows, our methods is in [Gap-affine (Smith-Waterman-Gotoh)], which called gap-affine-hirschberg-multi and gap-affine-russians-multi:

USE: ./align_benchmark -a ALGORITHM -i PATH
      Options::
        [Algorithm]
          --algorithm|a ALGORITHM
            [Indel (Longest Common Subsequence)]
              indel-wfa
            [Edit (Levenshtein)]
              edit-bpm
              edit-dp
              edit-dp-banded
              edit-wfa
            [Gap-linear (Needleman-Wunsch)]
              gap-linear-nw
              gap-linear-wfa
            [Gap-affine (Smith-Waterman-Gotoh)]
              gap-affine-swg
              gap-affine-swg-banded
              gap-affine-wfa
              gap-affine-hirschberg
              gap-affine-hirschberg-multi
              gap-affine-russians-multi
            [Gap-affine-2pieces (Concave 2-pieces)]
              gap-affine2p-dp
              gap-affine2p-wfa
        [Input & Output]
          --input|i PATH
          --output|o PATH
          --output-full PATH
          --input-ref|I PATH (Reference alignment file)
        [Penalties]
          --linear-penalties|p M,X,I
          --affine-penalties|g M,X,O,E
          --affine2p-penalties M,X,O1,E1,O2,E2
        [Wavefront parameters]
          --wfa-score-only
          --wfa-span 'global'|'extension'|'ends-free[,P0,Pf,T0,Tf]'
          --wfa-memory 'high'|'med'|'low'|'ultralow'
          --wfa-heuristic STRATEGY
          --wfa-heuristic-parameters  P1,P2[,P3]
            [STRATEGY='banded-static']
              P1 = minimum-diagonal-band (e.g., -100)
              P2 = maximum-diagonal-band (e.g., +100)
            [STRATEGY='banded-adaptive']
              P1 = minimum-diagonal-band (e.g., -100)
              P2 = maximum-diagonal-band (e.g., +100)
              P3 = steps-between-cutoffs
            [STRATEGY='wfa-adaptive']
              P1 = minimum-wavefront-length
              P2 = maximum-difference-distance
              P3 = steps-between-cutoffs
            [STRATEGY='xdrop']
              P1 = x-drop
              P2 = steps-between-cutoffs
            [STRATEGY='zdrop']
              P1 = z-drop
              P2 = steps-between-cutoffs
          --wfa-max-memory BYTES
          --wfa-max-steps INT
          --wfa-max-threads INT (intra-parallelism; default=1)
        [Russians Parameter]
          --russian-block|B INT (Russians block size; default=50)
        [Multithread Parameters]
          --dp-threads|D INT
             (SWG multithread/Hirschberg/Russians DP; default=1)
          --divide-threads|E INT (Hirschberg/Russians main; default=1)
        [Other Parameters]
          --bandwidth INT
        [Misc]
          --check|c 'correct'|'score'|'alignment'
          --check-distance 'indel'|'edit'|'linear'|'affine'|'affine2p'
          --check-bandwidth INT
          --plot
        [System]
          --num-threads|t INT
          --batch-size INT
          --progress|P INT
          --verbose|v INT
          --quiet|q
          --help|h

The usage for multiswg_benchmark is shown as follows, our method is gap-affine-swg-multithread:

USE: ./align_benchmark -a ALGORITHM -i PATH
      Options::
        [Algorithm]
          --algorithm|a ALGORITHM
            [Gap-affine (Smith-Waterman-Gotoh)]
              gap-affine-swg
              gap-affine-swg-multithread
        [Input & Output]
          --input|i PATH
          --output|o PATH
          --output-full PATH
          --input-ref|I PATH (Reference alignment file)
        [Penalties]
          --affine-penalties|g M,X,O,E
        [Multithread Parameters]
          --dp-threads|D INT
             (SWG multithread DP; default=1)
          --use-barrier|B  use barrier to align (default=false)
        [Misc]
          --check|c 'correct'|'score'|'alignment'
        [System]
          --num-threads|t INT
          --batch-size INT
          --progress|P INT
          --verbose|v INT
          --quiet|q
          --help|h

The tests in our paper is shown as follows:

Test method Name Test Command
(Run the following programs in bin folder)
benchmark (B) ./multiswg_benchmark -a gap-affine-swg -i $file_name -o $answer_name benchmark will output alignment CIGAR result for testing other methods
wfa-high (Wh) ./align_benchmark -a gap-affine-wfa --wfa-memory high -i $file_name -I $answer_name -c alignment --wfa-max-threads $cpus
wfa-med (Wm) ./align_benchmark -a gap-affine-wfa --wfa-memory med -i $file_name -I $answer_name -c alignment --wfa-max-threads $cpus
wfa-low (Wl) ./align_benchmark -a gap-affine-wfa --wfa-memory low -i $file_name -I $answer_name -c alignment --wfa-max-threads $cpus
wfa-ultra-low (Wul) ./align_benchmark -a gap-affine-wfa --wfa-memory ultralow -i $file_name -I $answer_name -c alignment --wfa-max-threads $cpus
hirschberg-raw (H1D1) ./align_benchmark -a gap-affine-hirschberg -i $file_name -I $answer_name -c alignment
hirschberg-single (HTD1) ./align_benchmark -a gap-affine-hirschberg-multi -i $file_name -I $answer_name -c alignment -E $cpus -D 1
hirschberg-multi (HTDT) ./align_benchmark -a gap-affine-hirschberg-multi -i $file_name -I $answer_name -c alignment -E $cpus -D $dpcpus $dpcpus is same as expr $cpus / 2
russians-multi-$t$ ($R_t$) ./align_benchmark -a gap-affine-russians-multi -i $file_name -I $answer_name -c alignment -E $cpus -D $dpcpus -B $blocksize $dpcpus is same as expr $cpus / 2
$blocksize is same as $t$
swg-barrier ./multiswg_benchmark -a gap-affine-swg-multithread -i $file_name -I $answer_name -B -c alignment -D $cpus use -B to align with barrier but only supported in C++20 or newer
swg-condition-variable ./multiswg_benchmark -a gap-affine-swg-multithread -i $file_name -I $answer_name -c alignment -D $cpus

Test dataset and program

The test dataset is stored at https://github.com/malabz/FORAlign_testcase. The test cases in paper is stored at https://github.com/malabz/FORAlign_testcase/blob/main/test-result/raw.tar.xz, the compiled program is stored at https://github.com/malabz/FORAlign_testcase/tree/main/wfa2-test-prog folder.

How to run our test program

The WFA2 program is only supported in Linux. For Windows user please install WSL first. WSL instructional video: 1 or 2 (Copyright belongs to the original work).

Use the compiled WFA2 program

Clone the FORAlign-testcase repository and test:

#1 Download
git clone https://github.com/malabz/FORAlign-testcase

#2 Open the folder
cd wfa2-test-prog

#3 Test
./multiswg_benchmark
./align_benchmark

Compile WFA2

Clone the FORAlign repository:

#1 Download
git clone https://github.com/malabz/FORAlign

#2 Open the folder and compile
cd FORAlign
make all THREADS=16

#3 Test
./bin/multiswg_benchmark
./bin/align_benchmark

For development

FORAlign work as a programming library. This section shows how to use the C/CPP APIs of FORAlign to take two sequences as input and perform pairwise sequence alignment. Basically, the library file libforalign.a and header file hirschberg.h are needed to make the Hirschberg algorithm in FORAlign library work in your program, and the library file libswg.a and header file parallel_swg.h make the SWG algorithm in FORAlign library work in your program.

A simple example for libforalign.a

First, include the FORAlign alignment headers.

#include "foralign/hirschberg.h"

Next, configure the function arugments. The hirschberg library uses gap affine penalties. Note that mismatch, gap open penalty, gap extension penalty and russian block size must be positive values, the match penalty should be 0.

/*
 M: match
 X: mismatch
 O: gap open
 E: gap extension
 table_size: russian table size
 threads: program threads
 */
int O, E, M, X, table_size, threads;
threads = 16; O = 6; E = 2; M = 0; X = 4; table_size = 200;

Finally, call the Hirschberg_API function. If you only want to get CIGAR string, call the hirschberg_cigar function.

char *seq1 = "ATCGTAT";
char *seq2 = "TTTTCTAAA";
sequence_t type = DNA;
char *comp_seq1 = NULL, *comp_seq2 = NULL;
hirschberg_API(seq1, seq2, strlen(seq1), strlen(seq2), O, E, M, X, table_size, type, threads, &comp_seq1, &comp_seq2, 0, 1, threads / 2);

See library_example/foralign.c for C API, library_example/foralign.cpp for CPP API. For example, to compile and run example files, you need to link against the FORAlign library (-lforalign) as follows:

make THREADS=16 # make the foralign library
mkdir -p bin
g++ -O3 library_example/foralign.cpp -pthread -L./lib -lforalign -o bin/foralign_cpp_example
# determine the link order
gcc -O3 library_example/foralign.c -o bin/foralign_c_example -L./lib -lforalign -pthread -lstdc++

./bin/foralign_cpp_example
./bin/foralign_c_example

A simple example for libswg.a

First, include the FORAlign alignment headers.

#include "swg/parallel_swg.h"

Next, configure the function arugments. The swg library uses gap affine penalties. The calcluate matrix should be allocated before call the function:

affine_penalties_t args; args.match = 0; args.mismatch = 4; args.gap_opening = 6; args.gap_extension = 2;
affine_matrix_t    mtx;
mtx.num_columns = len1; mtx.num_rows = len2;
mtx.columns = (affine_cell_t**)malloc(len1 + 1);
if(mtx.columns == nullptr) { fprintf(stderr, "Error: can not allocate. Program will exit.\n"); return 1; }
for(size_t i = 0; i < len2; ++ i)
{
    mtx.columns[i] = (affine_cell_t*)malloc(len2 + 1);
    if(mtx.columns[i] == nullptr) { fprintf(stderr, "Error: can not allocate. Program will exit.\n"); return 1; }
}

Finally, call the multithread_swg_compute_cv function.

char* seq1 = "TCTTTACTCGCGCGTTGGAGAAATACAATAGT";
char* seq2 = "TCTATACTGCGCGTTTGGAGAAATAAAATAGT";
int threads = 16;
multithread_swg_init(threads);
multithread_swg_compute_cv(threads, &mtx, &args, seq1, strlen(seq1), seq2, strlen(seq2));
multithread_swg_free();

Afterwards, we can use the library to display the alignment result (e.g., the pairwise alignment result).

char *comp_seq1 = NULL, *comp_seq2 = NULL;
multithread_swg_traceback(&mtx, &args, strlen(seq1), strlen(seq2), seq1, seq2, &comp_seq1, &comp_seq2);
printf("Alignment result: %s\n%s\n", comp_seq1, comp_seq2);

See library_example/swg.cpp for CPP API. For example, to compile and run example files, you need to link against the FORAlign SWG library (-lswg) as follows:

make THREADS=16 # make the swg and foralign library
mkdir -p bin
g++ -O3 library_example/swg.cpp -pthread -L./lib -lswg -o bin/swg_cpp_example -lforalign
# The function `read_2_seqs` is defined in foralign library

./bin/foralign_swg_cpp_example

Citation

Contacts

If you find any bug, welcome to contact us on the issues page or email us.

More tools and infomation can visit our github.

About

FORAlign: Accelerating gap-affine DNA pairwise sequence alignment using FOR-blocks based on FOur Russians approach with linear space complexity

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published