-
Notifications
You must be signed in to change notification settings - Fork 3
Home
This page presents a way to analyze SARS-CoV-2 GISAID FASTA genomes until November 30, 2020, by using a two-step pipeline. This pipeline can be useful when users are dealing with more than 100000 viral genome sequences.
- requires miniconda, python2.7 and python>=3. To install miniconda, see: https://docs.conda.io/en/latest/miniconda.html
git clone https://github.com/cfarkas/SARS-CoV-2-freebayes.git # clone repository
cd SARS-CoV-2-freebayes
conda config --add channels conda-forge # add conda-forge channel (if you haven't already done so)
conda config --add channels bioconda # add bioconda channel (if you haven't already done so)
conda env update --file environment.yml # install required programs
conda activate SARS-CoV-2-freebayes # activate SARS-CoV-2-freebayes enviroment
bash makefile.sh # make & install
sudo cp ./bin/* /usr/local/bin/ # Copy binaries into /usr/local/bin/ (require sudo privileges)
After these steps, a conda enviroment called SARS-CoV-2-freebayes can be managed as follows:
# To activate this environment, use
#
# $ conda activate SARS-CoV-2-freebayes
#
# To deactivate an active environment, use
#
# $ conda deactivate
By activating the enviroment, all scripts in the SARS-CoV-2-freebayes repository can be executed, without further installations.
- Finally, we need to install a vcftools version that includes the --haploid flag, as follows:
git clone https://github.com/cfarkas/vcftools.git # Forked Julien Y. Dutheil version: https://github.com/jydu/vcftools
cd vcftools
./autogen.sh
./configure
make # make
sudo make install # requires sudo privileges
1) Installing minimap2 aligner (for install details, please see: https://github.com/lh3/minimap2)
### Installing minimap2
git clone https://github.com/lh3/minimap2
# build minimap2
cd minimap2 && make
# with sudo privileges
sudo cp minimap2 /usr/local/bin/
2) Installing fastp: An ultra-fast all-in-one FASTQ preprocessor (for details, please see: https://github.com/OpenGene/fastp)
git clone https://github.com/OpenGene/fastp.git
# build fastp
cd fastp
make
# with sudo privileges
sudo cp fastp /usr/local/bin/
Clone Freebayes in a specific directory:
git clone --recursive git://github.com/ekg/freebayes.git
# If this line not work, try:
git config --global url.https://github.com/.insteadOf git://github.com/
git clone --recursive git://github.com/ekg/freebayes.git
#Enter Freebayes directory and make:
cd freebayes
make
sudo make install
sudo cp scripts/* /usr/local/bin/
#To check installation, type in terminal:
freebayes
bamleftalign
git config --global url.https://github.com/.insteadOf git://github.com/
git clone --recursive git://github.com/vcflib/vcflib.git
#Enter vcflib directory and make
cd vcflib
make # Needs CMake compiler, with sudo privileges do: sudo apt-get install cmake
cp scripts/* /usr/local/bin/
cp bin/* /usr/local/bin/
(Old samtools version can also work). Users need to install version up to date of these three packages. Users can first install htslib v1.9 and then samtools with bcftools v1.9, respectively. For downloading these packages, see http://www.htslib.org/download/). The latter can be accomplished by downloading the three packages, decompressing it, and doing the following:
wget https://github.com/samtools/htslib/releases/download/1.10.2/htslib-1.10.2.tar.bz2
bzip2 -d htslib-1.10.2.tar.bz2
tar -xvf htslib-1.10.2.tar
rm htslib-1.10.2.tar
cd htslib-1.10.2 # and similarly for samtools
sudo ./configure --prefix=/usr/local/bin
sudo make
sudo make install
# this step is only for samtools
sudo cp samtools /usr/local/bin/
# Similarly as htslib, samtools and bcftools can be downloaded as follows:
wget https://github.com/samtools/samtools/releases/download/1.10/samtools-1.10.tar.bz2
wget https://github.com/samtools/bcftools/releases/download/1.10.2/bcftools-1.10.2.tar.bz2
Then in a terminal type
samtools
to check 1.10 version (using htslib v1.10)
### Installing SRA toolkit from ncbi
wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.9.6/sratoolkit.2.9.6-ubuntu64.tar.gz
gunzip sratoolkit.2.9.6-ubuntu64.tar.gz
tar -xvf sratoolkit.2.9.6-ubuntu64.tar
cp sratoolkit.2.9.6-ubuntu64/bin/fastq-dump /usr/local/bin/
cp sratoolkit.2.9.6-ubuntu64/bin/prefetch /usr/local/bin/
pip install jacquard # elemental
pip install pyfasta # elemental
pip install pyfaidx # elemental
Note: pip3 requires the use of python >= 3
SeqKit - a cross-platform and ultrafast toolkit for FASTA/Q file manipulation (https://bioinf.shenwei.me/seqkit/) can be installed from repository as follows:
wget https://github.com/shenwei356/seqkit/releases/download/v0.12.1/seqkit_linux_386.tar.gz
gunzip seqkit_linux_386.tar.gz
tar -xvf seqkit_linux_386.tar
sudo cp seqkit /usr/local/bin/
We employed the version of Julien Y. Dutheil https://github.com/jydu/vcftools that includes the --haploid flag. Safe install can be achieved with root (sudo -i) as follows:
### Installing vcftools
sudo -i
cd /home/user/ # go to home directory or another directory
sudo apt-get install libz-dev zlib1g-dev # zlib requirements in Ubuntu
git clone https://github.com/jydu/vcftools.git
cd vcftools/
./autogen.sh
export PERL5LIB=/path/to/your/vcftools-directory/src/perl/
./configure
make
make install
exit
R # Open R
install.packages("ggplot2")
library(ggplot2)
install.packages("ggrepel")
library(ggrepel)
install.packages("vcfR")
library(vcfR)
- FASTA GISAID sequences are increasingly growing since the beginning of the pandemic, and it is preferable to process them by chunks of less than 100000 sequences. Users can obtain a single VCF with calculated viral frequencies in two steps (1 and 2) after downloading this repository, install it and process the user-provided GISAID sequences.
- If users have GISAID sequences until November 30, 2020 (>220k sequences), these commands will help to process all of them into a single VCF.
- Assuming binaries are in
/usr/local/bin
and users previously runnedsamtools faidx covid19-refseq.fasta
:
# Donwload and split GISAID genomes (i.e.: GISAID.fasta). We will split all the data in three pieces with pyfasta (assuming <300000 sequences).
pyfasta split -n 3 GISAID.fasta && gzip GISAID.fasta && rm *.fasta.flat *.fasta.gdx
# (1) Processing fasta chunks using 10 threads. Provide full path to covid19-refseq.fasta
ulimit -n 1000000 && ulimit -s 299999 # increase stack size and open file limit, see README_ulimit for details.
for fasta in *.fasta; do
SARS-CoV-2-processing-fasta -f ${fasta} -g /full/path/to/SARS-CoV-2-freebayes/covid19-refseq.fasta -t 10
done
# (4) Merge variants. Copy VCF files in the processed folders and execute SARS-CoV-2-merge-variants
mkdir all_variants
for i in ./GISAID.0/*vcf; do cp "$i" ./all_variants/; done
for i in ./GISAID.1/*vcf; do cp "$i" ./all_variants/; done
for i in ./GISAID.2/*vcf; do cp "$i" ./all_variants/; done
cd all_variants
ulimit -n 1000000 && ulimit -s 299999 # increase stack size and open file limit, see README_ulimit for details.
SARS-CoV-2-merge-variants -g /full/path/to/SARS-CoV-2-freebayes/covid19-refseq.fasta
- Merging 229116 GISAID genomes in a single VCF takes ~48 hrs using a single core.
- The output (merged.GISAID.AF.vcf) can be processed for pi-tajima analysis as presented here: https://github.com/cfarkas/SARS-CoV-2-freebayes/blob/master/README.md#iii-nucleotide-diversity-and-tajimas-d-test-calculation-with-pi-tajima
Calculate number of variants per genome. Inside all_variants folder, do:
ulimit -n 1000000 && ulimit -s 299999
{
vcf= ls -1 EPI_*.vcf
for vcf in EPI_*.vcf; do grep -P 'NC_045512.2\t' ${vcf} -c
done
#
} | tee logfile_variants_GISAID_freebayes
#
grep "EPI_ISL_" logfile_variants_GISAID_freebayes > vcf_files
grep -v "EPI_ISL_" logfile_variants_GISAID_freebayes > variants_per_sample
paste vcf_files variants_per_sample > logfile_variants_GISAID
rm vcf_files variants_per_sample
sed -i 's/.fa.left.vcf//'g logfile_variants_GISAID
rm logfile_variants_GISAID_freebayes
logfile_variants_GISAID file contains the GISAID accession along with the number of detected variants.