Skip to content
cfarkas edited this page Mar 30, 2021 · 92 revisions

Welcome to the SARS-CoV-2-freebayes wiki!

This page presents a way to analyze SARS-CoV-2 GISAID FASTA genomes until November 30, 2020, by using a two-step pipeline. This pipeline can be useful when users are dealing with more than 100000 viral genome sequences.

Installation Requirements:

Option 1: Via conda/pip (recommended)

git clone https://github.com/cfarkas/SARS-CoV-2-freebayes.git  # clone repository
cd SARS-CoV-2-freebayes
conda config --add channels conda-forge                        # add conda-forge channel (if you haven't already done so)
conda config --add channels bioconda                           # add bioconda channel (if you haven't already done so)
conda env update --file environment.yml                        # install required programs 
conda activate SARS-CoV-2-freebayes                            # activate SARS-CoV-2-freebayes enviroment 
bash makefile.sh                                               # make & install
sudo cp ./bin/* /usr/local/bin/                                # Copy binaries into /usr/local/bin/ (require sudo privileges)

After these steps, a conda enviroment called SARS-CoV-2-freebayes can be managed as follows:

# To activate this environment, use
#
#     $ conda activate SARS-CoV-2-freebayes
#
# To deactivate an active environment, use
#
#     $ conda deactivate

By activating the enviroment, all scripts in the SARS-CoV-2-freebayes repository can be executed, without further installations.

  • Finally, we need to install a vcftools version that includes the --haploid flag, as follows:
git clone https://github.com/cfarkas/vcftools.git       # Forked Julien Y. Dutheil version: https://github.com/jydu/vcftools
cd vcftools
./autogen.sh
./configure
make                                                    # make
sudo make install                                       # requires sudo privileges

Option 2: Without using conda, program by program:

1) Installing minimap2 aligner (for install details, please see: https://github.com/lh3/minimap2)

### Installing minimap2
git clone https://github.com/lh3/minimap2
# build minimap2
cd minimap2 && make
# with sudo privileges
sudo cp minimap2 /usr/local/bin/

2) Installing fastp: An ultra-fast all-in-one FASTQ preprocessor (for details, please see: https://github.com/OpenGene/fastp)

git clone https://github.com/OpenGene/fastp.git
# build fastp
cd fastp
make
# with sudo privileges
sudo cp fastp /usr/local/bin/

3) Obtaining and installing Freebayes:

Clone Freebayes in a specific directory:

git clone --recursive git://github.com/ekg/freebayes.git

# If this line not work, try:

git config --global url.https://github.com/.insteadOf git://github.com/
git clone --recursive git://github.com/ekg/freebayes.git

#Enter Freebayes directory and make:
cd freebayes
make
sudo make install
sudo cp scripts/* /usr/local/bin/

#To check installation, type in terminal:
freebayes
bamleftalign

4) Installing vcflib

git config --global url.https://github.com/.insteadOf git://github.com/
git clone --recursive git://github.com/vcflib/vcflib.git

#Enter vcflib directory and make
cd vcflib
make   # Needs CMake compiler, with sudo privileges do: sudo apt-get install cmake
cp scripts/* /usr/local/bin/
cp bin/* /usr/local/bin/

5) Obtaining and installing up-to-date SAMtools with htslib (version >= 1.9)

(Old samtools version can also work). Users need to install version up to date of these three packages. Users can first install htslib v1.9 and then samtools with bcftools v1.9, respectively. For downloading these packages, see http://www.htslib.org/download/). The latter can be accomplished by downloading the three packages, decompressing it, and doing the following:

wget https://github.com/samtools/htslib/releases/download/1.10.2/htslib-1.10.2.tar.bz2
bzip2 -d htslib-1.10.2.tar.bz2
tar -xvf htslib-1.10.2.tar
rm htslib-1.10.2.tar
cd htslib-1.10.2    # and similarly for samtools
sudo ./configure --prefix=/usr/local/bin
sudo make
sudo make install
# this step is only for samtools
sudo cp samtools /usr/local/bin/

# Similarly as htslib, samtools and bcftools can be downloaded as follows:

wget https://github.com/samtools/samtools/releases/download/1.10/samtools-1.10.tar.bz2
wget https://github.com/samtools/bcftools/releases/download/1.10.2/bcftools-1.10.2.tar.bz2

Then in a terminal type

samtools

to check 1.10 version (using htslib v1.10)

6) Obtaining SRA toolkit from ncbi (for downloading reads from SRA archive).

### Installing SRA toolkit from ncbi
wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.9.6/sratoolkit.2.9.6-ubuntu64.tar.gz
gunzip sratoolkit.2.9.6-ubuntu64.tar.gz
tar -xvf sratoolkit.2.9.6-ubuntu64.tar
cp sratoolkit.2.9.6-ubuntu64/bin/fastq-dump /usr/local/bin/
cp sratoolkit.2.9.6-ubuntu64/bin/prefetch /usr/local/bin/

7) Installing jaqcuard, pyfasta, pyfaidx, inStrain and vcfstats python libraries

pip install jacquard    # elemental
pip install pyfasta     # elemental
pip install pyfaidx     # elemental

Note: pip3 requires the use of python >= 3

8) Installing seqkit

SeqKit - a cross-platform and ultrafast toolkit for FASTA/Q file manipulation (https://bioinf.shenwei.me/seqkit/) can be installed from repository as follows:

wget https://github.com/shenwei356/seqkit/releases/download/v0.12.1/seqkit_linux_386.tar.gz
gunzip seqkit_linux_386.tar.gz
tar -xvf seqkit_linux_386.tar
sudo cp seqkit /usr/local/bin/

9) Obtaining and Installing VCFtools

We employed the version of Julien Y. Dutheil https://github.com/jydu/vcftools that includes the --haploid flag. Safe install can be achieved with root (sudo -i) as follows:

### Installing vcftools 
sudo -i
cd /home/user/ # go to home directory or another directory
sudo apt-get install libz-dev zlib1g-dev  # zlib requirements in Ubuntu
git clone https://github.com/jydu/vcftools.git
cd vcftools/
./autogen.sh
export PERL5LIB=/path/to/your/vcftools-directory/src/perl/ 
./configure
make
make install
exit

10) Install ggplot2, ggrepel and vcfR R libraries

R   # Open R 
install.packages("ggplot2")
library(ggplot2)
install.packages("ggrepel")
library(ggrepel)
install.packages("vcfR")
library(vcfR)

I) Two-Step Full pipeline: Processing all GISAID sequences until November 30, 2020

  • FASTA GISAID sequences are increasingly growing since the beginning of the pandemic, and it is preferable to process them by chunks of less than 100000 sequences. Users can obtain a single VCF with calculated viral frequencies in two steps (1 and 2) after downloading this repository, install it and process the user-provided GISAID sequences.
  • If users have GISAID sequences until November 30, 2020 (>220k sequences), these commands will help to process all of them into a single VCF.
  • Assuming binaries are in /usr/local/bin and users previously runned samtools faidx covid19-refseq.fasta:
# Donwload and split GISAID genomes (i.e.: GISAID.fasta). We will split all the data in three pieces with pyfasta (assuming <300000 sequences).   
pyfasta split -n 3 GISAID.fasta && gzip GISAID.fasta && rm *.fasta.flat *.fasta.gdx

# (1) Processing fasta chunks using 10 threads. Provide full path to covid19-refseq.fasta 
ulimit -n 1000000 && ulimit -s 299999  # increase stack size and open file limit, see README_ulimit for details.
for fasta in *.fasta; do
SARS-CoV-2-processing-fasta -f ${fasta} -g /full/path/to/SARS-CoV-2-freebayes/covid19-refseq.fasta -t 10
done

# (4) Merge variants. Copy VCF files in the processed folders and execute SARS-CoV-2-merge-variants
mkdir all_variants
for i in ./GISAID.0/*vcf; do cp "$i" ./all_variants/; done
for i in ./GISAID.1/*vcf; do cp "$i" ./all_variants/; done
for i in ./GISAID.2/*vcf; do cp "$i" ./all_variants/; done
cd all_variants
ulimit -n 1000000 && ulimit -s 299999  # increase stack size and open file limit, see README_ulimit for details.
SARS-CoV-2-merge-variants -g /full/path/to/SARS-CoV-2-freebayes/covid19-refseq.fasta

Number of variants per genome

Calculate number of variants per genome. Inside all_variants folder, do:

ulimit -n 1000000 && ulimit -s 299999 
{
vcf= ls -1 EPI_*.vcf
for vcf in EPI_*.vcf; do grep -P 'NC_045512.2\t' ${vcf} -c
done
#
} | tee logfile_variants_GISAID_freebayes
#
grep "EPI_ISL_" logfile_variants_GISAID_freebayes > vcf_files
grep -v "EPI_ISL_" logfile_variants_GISAID_freebayes > variants_per_sample
paste vcf_files variants_per_sample > logfile_variants_GISAID
rm vcf_files variants_per_sample
sed -i 's/.fa.left.vcf//'g logfile_variants_GISAID
rm logfile_variants_GISAID_freebayes

logfile_variants_GISAID file contains the GISAID accession along with the number of detected variants.