Skip to content

Latest commit

 

History

History
137 lines (114 loc) · 4.93 KB

File metadata and controls

137 lines (114 loc) · 4.93 KB

The installation and usage of BWA-MEM

1. About

Splice-unware. BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome. t consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM. BWA-MEM, which is the latest, is generally recommended for high-quality queries as it is faster and more accurate.

2. Installation and Usage

Download the latest release from and uncompress it

git clone https://github.com/lh3/bwa.git
cd bwa; make
./bwa index ref.fa
./bwa mem ref.fa read-se.fq.gz | gzip -3 > aln-se.sam.gz
./bwa mem ref.fa read1.fq read2.fq | gzip -3 > aln-pe.sam.gz

2.1 Build genome index

#!/bin/bash -l
#SBATCH --account=b1171
#SBATCH --partition=b1171
# set the number of nodes
#SBATCH --nodes=1
#SBATCH --ntasks=1
##SBATCH --exclusive
#SBATCH --mem=30G
#SBATCH --chdir=/home/qgn1237/qgn1237/1_my_database/GRCh38_p13/bwa1_index
# set max wallclock time
#SBATCH --time=100:00:00
# mail alert at start, end and abortion of execution
#SBATCH --mail-type=ALL

# run the application
cd $SLURM_SUBMIT_DIR
/home/qgn1237/2_software/mambaforge/bin/mamba init
source ~/.bashrc
mamba activate visorenv
~/2_software/bwa/bwa index ../GRCh38.p13.genome.fa

2.2 Running mapping job with BWA-MEM

For single-end read:

bwa mem -t -Y -M -R 4 ref.fa reads.fq > mem-se.sam

For paired-end reads:

#!/bin/bash -l
#SBATCH --account=b1042
#SBATCH --partition=genomics
# set the number of nodes
#SBATCH --nodes=1
#SBATCH --ntasks=4
##SBATCH --exclusive
#SBATCH --mem=100G
#SBATCH --chdir=/home/qgn1237/working/1_reads_mapping/bwa-mem
# set max wallclock time
#SBATCH --time=36:00:00
# mail alert at start, end and abortion of execution
#SBATCH --mail-type=ALL

# run the application
cd $SLURM_SUBMIT_DIR
/home/qgn1237/2_software/mambaforge/bin/mamba init
source ~/.bashrc
mamba activate visorenv
/home/qgn1237/2_software/bwa/bwa mem -t 4 -Y -M -R '@RG\tID:SRR7346979\tPL:illumina\tLB:library\tSM:SAMN09428901' /projects/b1171/qgn1237/1_my_database/GRCh38_p13/GRCh38.p13.genome.fa /projects/b1171/qgn1237/2_raw_data/SKBR3/illumina_250_SRR7346979/SRR7346979/SRR7346979_1.fastq /projects/b1171/qgn1237/2_raw_data/SKBR3/illumina_250_SRR7346979/SRR7346979/SRR7346979_2.fastq | samtools sort -@ 12 -O BAM -o SKBR3_NGS_bwa.bam && samtools index SKBR3_NGS_bwa.bam SKBR3_NGS_bwa.bai

-M Mark shorter split hits as secondary (for Picard compatibility). -t INT Number of threads. -R STR Complete read group header line. ’\t’ can be used in STR and will be converted to a TAB in the output SAM. The read group ID will be attached to every read in the output. An example is ’@RG\tID:foo\tSM:bar’.

2.3 Output files

BWA outputs the final alignment in the SAM (Sequence Alignment/Map) format. Each line consists of:

Col Field Description 1 QNAME Query (pair) NAME 2 FLAG bitwise FLAG 3 RNAME Reference sequence NAME 4 POS 1-based leftmost POSition/coordinate of clipped sequence 5 MAPQ MAPping Quality (Phred-scaled) 6 CIAGR extended CIGAR string 7 MRNM Mate Reference sequence NaMe (‘=’ if same as RNAME) 8 MPOS 1-based Mate POSistion 9 ISIZE Inferred insert SIZE 10 SEQ query SEQuence on the same strand as the reference 11 QUAL query QUALity (ASCII-33 gives the Phred base quality) 12 OPT variable OPTional fields in the format TAG:VTYPE:VALUE

Each bit in the FLAG field is defined as:

Chr Flag Description p 0x0001 the read is paired in sequencing P 0x0002 the read is mapped in a proper pair u 0x0004 the query sequence itself is unmapped U 0x0008 the mate is unmapped r 0x0010 strand of the query (1 for reverse) R 0x0020 strand of the mate 1 0x0040 the read is the first read in a pair 2 0x0080 the read is the second read in a pair s 0x0100 the alignment is not primary f 0x0200 QC failure d 0x0400 optical or PCR duplicate

BWA generates the following optional fields. Tags starting with ‘X’ are specific to BWA.

Tag Meaning NM Edit distance MD Mismatching positions/bases AS Alignment score BC Barcode sequence X0 Number of best hits X1 Number of suboptimal hits found by BWA XN Number of ambiguous bases in the referenece XM Number of mismatches in the alignment XO Number of gap opens XG Number of gap extentions XT Type: Unique/Repeat/N/Mate-sw XA Alternative hits; format: (chr,pos,CIGAR,NM;)* XS Suboptimal alignment score XF Support from forward/reverse alignment XE Number of supporting seeds Note that XO and XG are generated by BWT search while the CIGAR string by Smith-Waterman alignment. These two tags may be inconsistent with the CIGAR string. This is not a bug.

3. Citation

Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754-1760. [PMID: 19451168] Li H. and Durbin R. (2010) Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26, 589-595. [PMID: 20080505] Li H. (2012) Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics, 28, 1838-1844. [PMID: 22569178]