Skip to content

An example of real assembly

Dinghua Li edited this page Mar 31, 2019 · 6 revisions

We use a human gut dataset (SRR341725) in this tutorial.

1. Download MEGAHIT

https://github.com/voutcn/megahit/releases/download/v1.2.1-beta/MEGAHIT-1.2.1-beta-Linux-static.tar.gz
tar zvxf MEGAHIT-1.2.1-beta-Linux-static
cd MEGAHIT-1.2.1-beta-Linux-static/bin/
./megahit --test  # run on a toy dataset
cd ../..

2. Download the data

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR341/SRR341725/SRR341725_[12].fastq.gz

3. Assemble

MEGAHIT-1.2.1-beta-Linux-static/bin/megahit -1 SRR341725_1.fastq.gz -2 SRR341725_2.fastq.gz -o SRR341725.megahit_asm

When the assembly finishes, you will see a screen message like:

--- [Sat Aug 15 15:01:37 2015] Merging to output final contigs ---

--- [STAT] 70776 contigs, total 117999745 bp, min 200 bp, max 217183 bp, avg 1667 bp, N50 3897 bp

--- [Sat Aug 15 15:01:38 2015] ALL DONE. Time elapsed: 1386.712479 seconds ---

Assembled contigs are located in SRR341725.megahit_asm/final.contigs.fa.

Tips:

  • Single-end reads could be specified by -r
  • Interleaved PE reads could be specified by --12
  • -r, --12, -1 and -2 could be specified for multiple times to assemble multiple libraries; be careful that files in -1 and -2 should be paired and in order

4. Calculate contig coverage and extract unassembled reads

Here we only demonstrate how to calculate the read coverage per contig and extract unassembled reads using BBMap and samtools. You may pass the contigs your own subsequent analyses.

4.1 Download & install BBMap and samtools:

# BBmap
wget http://downloads.sourceforge.net/project/bbmap/BBMap_35.34.tar.gz
tar zvxf BBMap_35.34.tar.gz
# samtools
wget https://github.com/samtools/samtools/releases/download/1.2/samtools-1.2.tar.bz2
tar xvjf samtools-1.2.tar.bz2
cd samtools-1.2 && make -j && cd ..

4.2 Align reads with bbwrap.sh:

bbmap/bbwrap.sh ref=SRR341725.megahit_asm/final.contigs.fa in=SRR341725_1.fastq.gz in2=SRR341725_2.fastq.gz out=aln.sam.gz kfilter=22 subfilter=15 maxindel=80

Tips:

  • bbwrap also accepts multiple read libraries. For example two PE libraries and one SE library can be aligned with in=pe1_1.fq,pe2_1.fq,se.fq in2=pe1_2.fq,pe2_2.fq
  • kfilter is the minimum length of consecutive exact matches for an alignment, which is set to k_min + 1
  • maxindel limits the indel size
  • subfilter limits the number of mismatches for an alignment

4.3 Output per contig coverage to cov.txt with pileup.sh:

bbmap/pileup.sh in=aln.sam.gz out=cov.txt

4.4 Extract unmapped reads (SE to unmapped.se.fq and PE to unmapped.pe.fq):

samtools-1.2/samtools view -u -f4 aln.sam.gz | samtools-1.2/samtools bam2fq -s unmapped.se.fq - > unmapped.pe.fq