Objectives
-
Prepare bam files for calling SNPs.
In a previous practical, you used Kallisto to align the RNA-Seq reads from spike material of two wheat cultivars, Chinese Spring and Azhurnaya and obtained a bam file. You can use these files.
-
Learn how to call SNPs with Freebayes
-
Become familiar with VCF format files
-
Learn some criteria for filtering VCF files
Step 1 - Remove duplicate reads from the bamfiles with Samtools
Inputs
-
Location: /var/scratch/baileyp/Practical_4_small_wheat_dataset/sorted_bam_from_HISAT2_mapping
-
bam files for Chinese Spring: DF5 and DF19 (two replicate samples)
-
bam files for Azhurnaya: 119B and 120A (two replicate samples
-
Create a working directory for the practical
Package to load
module load samtools/1.8
-
It is important to remove duplicate reads before calling SNPs if PCR was used in the preparation of your sequencing library. We will use Samtools to do this step. It is a four step process:
-
Take the DF5 bam file and sort it in read name order (this step can be omitted if the file is already name ordered, but it is not in this case!):
samtools sort -n -o CS_DF5_st_namesort.bam /var/scratch/baileyp/Practical_4_small_wheat_dataset/sorted_bam_from_HISAT2_mapping/sorted_DF5_P.bam
- Add MC and ms tags to the bam records (used by the markdup command):
samtools fixmate -m CS_DF5_st_namesort.bam CS_DF5_st_fixmate.bam
- Now sort the file again by position (field 4 of the sam record)
samtools sort -o CS_DF5_st_positionsort.bam CS_DF5_st_fixmate.bam
- Finally, we are ready to mark and remove duplicate reads:
samtools markdup -s -r CS_DF5_st_positionsort.bam CS_DF5_st_rmdup.bam
-
Question: From the output printed to the screen or your log files, what is the percent of duplicate reads?
-
View the output with samtools view:
samtools view CS_DF5_st_rmdup.bam | head
- can you see ther information that has been added to the end of bam records (ms and MC fields)
- These steps need to be repeated for the other three samples. To make this slightly easier to do, we can set two variables that contain the sample name, then run the first command again using those variables that are set up to process the next sample, DF19:
sample=DF19 # Other sample names: 119B, 120A
samplePrefix=CS_DF19 # Other sample prefixes: Az_119B, Az_120A
samtools sort -n -o ${samplePrefix}_st_namesort.bam /var/scratch/baileyp/Practical_4_small_wheat_dataset/sorted_bam_from_HISAT2_mapping/sorted_${sample}_P.bam
-
After running the above command, reset the variables to process the other two samples with the same command. Use the up arrow key to retrieve the command.
-
Note: If you get confused using the variables, you can perform each step manually for each sample, as described that the top of this step for the first sample, CS_DF5.
-
Note: You can place an ampersand character ('&') at the end of any command to make it run in the background. Then you can carry on using the command line for something else. When the program has finished you will get a message printed to the screen - e.g.:
[1]+ Done name_of_command
-
Now perform the reamining steps of de-duplication process. An example is given below for the DF19 sample and running the fixmate step. Remember, you need to set the variable for each sample and retrieve the command you want to run using the up arrow key. Suggested sample prefix names: CS_DF19, Az_119B, Az_120A
samplePrefix=CS_119B
samtools fixmate -m ${samplePrefix}_st_namesort.bam ${samplePrefix}_st_fixmate.bam
Step 2 - Merge the bam files and add read group information
- Merge the bamfiles for all samples. The -r flag attaches an identifier (read group) to the bam record, based on the file name:
samtools merge -r merged.bam CS_DF5_st_rmdup.bam CS_DF19_st_rmdup.bam Az_119B_st_rmdup.bam Az_120A_st_rmdup.bam
- Next, you need to add a short recognisable name for each sample into the read group information. To do this you first need to extract the header from the bam file:
samtools view -H merged.bam > merged_header.txt
- Now add the read group info at the bottom of the header file with your favourite editor or use vi and type:
vi merged_header.txt
-
Go to the last line in the file.
-
Type a 'G'.
-
Press 'i' to insert text.
-
Copy and paste these lines at the bottom of the file (the spaces need to be tabs!). Check that the "ID" tag is the same as the bam file name (minus the .bam ending). The SM (sample) tag is used as the sample name by Freebayes:
@RG ID:CS_DF5_st_rmdup SM:CS_DF5 LB:LIB1 PL:Illumina
@RG ID:CS_DF19_st_rmdup SM:CS_DF19 LB:LIB2 PL:Illumina
@RG ID:Az_119B_st_rmdup SM:Az_119B LB:LIB3 PL:Illumina
@RG ID:Az_120A_st_rmdup SM:Az_120A LB:LIB4 PL:Illumina
-
Then press 'shift :', then type wq to save and quit the vi editor.
-
Tip: Other useful commands in the vi editor escape mode: dd = delete current line; gg goes to the top of the file again
-
Next reheader the bam file:
samtools reheader merged_header.txt merged.bam > merged_readgroups.bam
-
Question: How else could read group info be added to the samples?
-
Check that the header was applied properly:
samtools view -H merged_readgroups.bam
- Now index the bam file. The index can be used for quick retrieval of records by downstream applications:
samtools index merged_readgroups.bam
Step 3: Call SNPs with FreeBayes
-
Paper reference - it's very technical 🙈!
Inputs
-
merged_readgroups.bam prepared above in step 2.
-
Genome reference: /var/scratch/baileyp/Practical_4_RicardosExample/References/transformed_coordinates.fasta
Package to load
module load freebayes/1.0.2
- We can now run FreeBayes on the alignments of all 4 samples simultaneously and generate a VCF file:
freebayes \
--fasta-reference /var/scratch/baileyp/Practical_4_RicardosExample/References/transformed_coordinates.fasta \
merged_readgroups.bam \
> all_snps.vcf &
-
Note the ' \ ' characters. This allows the command to straddle across multiple lines. This makes commands clearer to visualise when many flags are being used.
-
Freebayes will take about 30 - 35 minutes to run. To speed up the analysis, it is possible to call SNPs on a particular part of the reference. The program will then run more quickly. To demonstrate this just run the program for wheat chromosome 1A using the -r flag. In the reference fasta file, the name of the chromosome is "chr1A":
freebayes \
-r chr1A \
--fasta-reference /var/scratch/baileyp/Practical_4_RicardosExample/References/transformed_coordinates.fasta \
merged_readgroups.bam \
> chr1A_snps.vcf &
-
If you had a large number of samples you could analyse each chromosome arm separately but run them in parallel at the same time. This could save a lot of time.
-
Question: How would you specify to only call SNPs on particular parts of a chromosome and also multiple parts of multiple chromosomes? Take a Look in the command line help and see if you can spot the flags to use:
freebayes -h | more
-
Question: How would you find out the name/identifier of the chromsomes to specify in these flags?
-
There are some filtering parameters avaialable in FreeBayes but it might be better to filter the SNPs after collecting all the raw calls - examples:
- -C --min-alternate-count (default = 2) 6 might be a better minimum to tolerate
- -i --no-indel Ignore insertion and deletion alleles
Step 4: Understand the VCF format and filter the VCF file
-
The VCF format looks very complicated! Here are three links that explain the format. (You may find one easier to understand than the others):
https://en.wikipedia.org/wiki/Variant_Call_Format
What is a VCF and how should I interpret it?
-
The aim here is to introduce you to the essential syntax and fields. Let's do some filtering with VCFtools:
-
VCFtools manual On the command line, type the following to see the program manual containig the options:
man vcftools
-
Package to load
module load vcftools/0.1.15
-
We will filter the SNPs on the following basic criteria (vcftools flags in brackets):
- minimum quality (--minQ) 20
- depth for an individual sample (--minDP) 6
- remove indels (--remove-indels)
- allow no missing genotypes (max-missing 1)
-
Add these flags to the command below, one after the other and keep a record of the number of SNPs you find after adding each additional filter.
vcftools --vcf all_snps.vcf \
[ add your filters here ]
-
Try also adding these parameters to obtain the calculations for the transition (Ts)/tranversion ratio (Tv): --TsTv-summary --TsTv-by-count
-
You will probably get an error but it's very simple to correct.
-
Is the ratio Ts/Tv as you would expect or too high, too low?
-
Finally, print out the filtered SNPs to a file. You need to add the recode flag(s) before a filtered vcf file will be produced.
vcftools --vcf all_snps.vcf \
--recode --recode-INFO-all \
--minQ 20 \
--minDP 6 \
--remove-indels \
--max-missing 1 \
--out all_minQ20_minDP6_noindels_maxm1
- Try adding up the values you find in the INFO and FORMAT fields and see whether they add up to a field that represents the total for the number of alleles or depth. Examples to try:
List of tools you could also use to carry some of the above tasks
-
For marking duplicates: Picard MarkDuplicates program
-
Other SNP callers:
- GATK - another haplotype caller like FreeBayes (but FreeBayes is easier to use)
- Index of GATK programs (see under Variant Discovery tools)
- Using GATK is a 2 step process. For Calling germline SNPs you would use HaplotypeCaller followed by GenotypeGVCFs
- Samtools
-
Other VCF filtering tools:
- bcftools (see Samtools manual pages)
- SnpSift
- VEP
Step 5 - Annotate snpEFF Annotation
There are programs to annotate SNPs according to whether the SNP falls within a gene to produce a deleterious mutation e.g. stop codon or potentially a non-synomymous substitution. We will use the snpEff tool:
module load snpeff/4.1g
snpEff \
-c /var/scratch/baileyp/Practical_4_small_wheat_dataset/snpEff/snpEff.config \
refseq_mini \
all_minQ20_minDP6_noindels_maxm1.recode.vcf \
> all_minQ20_minDP6_noindels_maxm1_snpeff.vcf
The output comes with a summary html file containing information on the number of genes with SNP effects or consequences.
A complete list of consequences can be found at Ensembl [here] (https://www.ensembl.org/info/genome/variation/predicted_data.html#consequences).
How would you quickly count the number of stop codons and non-synomymous substitutions effects in the new snpeff.vcf file. How many of each are there? It is better to use a filtering tool however e.g. snpSift (see via the snpEff manual above)
Another SNP effect annotation tool to look at is the the Ensembl Variant Effect Predictor (VEP) which also has its own filtering tool.
**Answers to the questions**
Question: How else could read group info be added to the samples? Read groups can be added when setting up many read aligners e.g. BWA mem - go to the manual and look at the mem command and the -R option. (BWA manual)