Skip to content
SaltyRydM edited this page Oct 20, 2023 · 55 revisions

Analysis of transcriptomics on SAGA

Marius Saltvedt 9/28/2021

1 UNIX


1.1 View text files

Use more or less command followed by filename to view text-files


1.2 Edit text files

vim to edit text file

  • press i to insert, start writing
  • pressesc followed by : to insert command
  • press :wq to write and quit
  • press :q or :q! to quit or forcequit without saving, respectively

1.3 Display a nice table

column -t will help display a nice table, but if it is a long table you can choose to display only some of it by using the head command, followed by number of rows (by default its only 10 rows).

<tablefile head -n 100 | column -t

1.4 See disk usage of files

du -sh filename

1.5 Change filename

You can use the mv command to both just change filename, and use it to mve files from one location to another. Below I have just shown an example of changing the filename of an existing file.

mv file.txt new_file.txt

1.6 Copy or move files

You can copy and/or move files by using cp or mv, respectively. To move a file you write mv file1 /this/folder/directory. You do the exact same for copying, where you start with the command and specify what file you want to copy and where you want to copy it to cp file1 /this/folder/directory. If you only want to copy certain files within many folders, but keep the folder structure you must add --parents. You can also use another copy function rsync to also exclude specific file names/extentions. Below are a few examples:

mv 4_kallisto /cluster/projects/nn9845k/transcriptomics/3_PkVRF01_Pk033

cp 4_kallisto /cluster/projects/nn9845k/transcriptomics/3_PkVRF01_Pk033

cp --parents 5_star_results/Sample_*/Aligned.out.bam /cluster/work/users/nis005

rsync -av --exclude '*fq.gz' 5_star_results /cluster/work/users/nis005

1.7 Search and retrieve info

There are several ways you can search and retrieve specific information from a file, but the grep command is commonly used. If you want specific row numbers you can use sed. You can save the results to another textfile by using search > filename. However, this will overwrite anything existing in this file. You can append the search to this file by using two >> instead.

This find line 7 to 10 in the test.csv file and then append to test.txt file.

sed -n '7,10p' test.csv >> test.txt

If you are interested in only creating a list with all file names or folder names from a directory you can use the ls command and write the output to a text file:

ls >> Out_file.txt

1.8 File order and length of files

While being in a directory you can save everything in that directory as an object. Below we have called that object files, where the (*) symbolise to save everything in pwd. To call this object you must use $ prior to the name. Any element of an array may be referenced using following syntax: ${ArrayName[subscript]}. You can also make up an array ArrayName=("element 1" "element 2" "element 3"). Lets imagine that files contain files=("1_data" "2_analysis" "3_script"). Now we can display anything in the object files by using the order of them (starting with 0 as the first element).

files=(*)

echo "${files[0]}" #will print 1_data
echo "${files[2]}" #will print 3_script

1.9 # of elements

If you need to know how many elements that are present in your object you can use ${#ArrayName[@]}

echo ${#files[@]} 

#will print out the number 3
# Because it contains three elements (1_data 2_analysis 3_script)

If subscript is @ or *, the word expands to all members of name. By prefixing # to variable you will find length of an array (i.e number of elements). Now we can use bash for loop to read or print values from $files:

# define it
files=("1_data" "2_analysis" "3_script")
 
# get length of $distro array
len=${#files[@]}
 
# Use bash for loop 
for (( i=0; i<$len; i++ )); do echo "${files[$i]}" ; done

# Print from for-loop
1_data
2_analysis
3_script

This for-loop will the run through every element position in files and echo/print this for this position. If you also add ; echo $i; you will see that $i is just a number and will start with i=0, then 1 and then 2 since we only have three elements.

1.10 For-loop and multiple file changes

Create a for-loop that makes changes to all files in a folder:

This for-loop will find all csv files, copy this csv file and change the file extension to txt instead while keeping the original name of the file. If you want to just replace the original file with the new extension you can just replace cp with mv.

for i in *.csv; do 
cp -- "$i" "${i%.csv}.txt"; 
done

1.11 Common unix commands




2 NeLS

  1. Start with logging into NeLS website

  2. Hold you mouse over your name in upper left corner

  3. Click on Connection details

  4. Click on Download key

    • OBS: In the new version of NeLS you will find the key under My Profile
  5. Place key in any folder on you computer and navigate (cd: change directory) to this location through your terminal.

  6. Make sure you have correct access to the key by running the command below. Remember to adjust the code below to fit your key (add personal ID). These commands are also available on NeLS (where you donwloaded the key) and are adjusted to your specific key.

chmod 600 PERSONAL_ID@newstor.cbu.uib.no.key

  1. In the same directory as the key, log in to NeLS by using similar code as below.
    • NOTE: You might be asked if you are sure you want to continue connecting. Type yes and you should automatically connect to NeLS
ssh -i PERSONAL_ID@newstor.cbu.uib.no.key PERSONAL_ID@newstor.cbu.uib.no

  1. Transferring files to personal space in NeLS:
    You can transfer files to your personal space within NeLS, but it is not necessary to do this. You can transfer files directly from projects to a remote server, such as SAGA.
scp -i PERSONAL_ID@newstor.cbu.uib.no.key file-to-send-to-nels PERSONAL_ID@newstor.cbu.uib.no:Personal

9.1 Transferring files to another remote server example:
Below is an example on how to transfer a file from a project on NeLS, to a remote server, such as SAGA:

scp -r * -i PERSONAL_ID@newstor.cbu.uib.no.key 
PERSONAL_ID@newstor.cbu.uib.no:Projects/UiB_Sandaa_VirVar_Marine_phytoplankton_NGS_2020/210126_A00943.B.Project_Sandaa-RNA1-2020-12-04
USERNAME@saga.sigma2.no:/cluster/projects/nn9845k/FOLDERNAME

9.2 If you change directory (cd) to relevant folder in NeLS you do not need to specify the full file path, only the specific file(s) you want to transfer:

cd Projects/UiB_Sandaa_VirVar_Marine_phytoplankton_NGS_2020

scp -r * -i PERSONAL_ID@newstor.cbu.uib.no.key 
210126_A00943.B.Project_Sandaa-RNA1-2020-12-04
USERNAME@saga.sigma2.no:/cluster/projects/nn9845k/FOLDERNAME




3 SAGA

3.1 Common commands

  • login: ssh USERNAME@saga.sigma2.no

  • Project folder: /cluster/projects/nn9845k

  • Go to home directory: cd $HOME OR cd OR ~

  • See allocated storage: dusage

  • See CPU hours used/remaining: cost

  • Work space: cd $USERWORK
    Personal allocated temporary space, deleted after 42 days

  • View computing nodes: freepe
    View computing nodes, see if there are free allocated spaces and view workload

  • View current CPU hours: qsumm

  • Watch the queue system: squeue -u MyUsername OR squeue -j JobId

3.2 Cancelling jobs

scancel JobId                # Cancel job with id JobId (as returned from sbatch)
scancel --user=MyUsername    # Cancel all your jobs
scancel --account=MyProject  # Cancel all jobs in MyProject

3.3 putting jobs on hold

scontrol requeue JobId  # Requeue a running job. The job will be stopped, and its state changed to pending.

scontrol hold JobId     # Hold a pending job. This prevents the queue system from starting the job. The job reason will be set to JobHeldUser.

scontrol release JobId  # Release a held job. This allows the queue system to start the job.

sbatch --hold JobScript # Submit a job and put it on hold immediately 

3.4 Modules

module avail - to see available modules
module load - load programs into memory
module purge - unload all loaded programs
module list - current loaded program

3.5 Starting an interactive job on saga

This helps avoiding to wait in queue and with this you can run as many script as you want within that time limit and cpu hours.

srun --account=nn9845k --mem-per-cpu=10G --time=10:00:00 --cpus-per-task=8 --pty bash -i

3.6 Creating alias on saga

  • In your home directory (where you start when you login to saga) type:
vim .bash_profile
  • Under User specific environment and startup programs you can add as many aliases as you want. Below you will find a few examples:
# .bash_profile

# Get the aliases and functions
if [ -f ~/.bashrc ]; then
        . ~/.bashrc
fi

# User specific environment and startup programs
alias virvar='cd /cluster/projects/nn9845k'
alias home='cd $home'
alias work='cd $USERWORK'
alias sq='squeue -u nis005'

PATH=$PATH:$HOME/.local/bin:$HOME/bin

export PATH
  • In the above example we can now use these aliases as commands. So when I type:

    • virvar, go to project folder
    • home, go to home folder
    • work, go to personal working environment
    • sq, see jobs running

3.7 Change group affiliation on files

This will change the group affiliation on all the files in the project fold. This is necessary to do if transferred files are still linked to another project or person. You can see group affiliation by typing ls -l.

chgrp -R nn9845k /cluster/projects/nn9845k/




4 VirVar transcriptomics

  • Anders Krabberød description: https://github.com/krabberod/VirVar_transcriptomes\

  • Visualize the sequence quality:

    • Look at the quaility of the data by opening the FASTQ PDF file
    • Sequence report: phred score table
    • Sequencing forming a bridge that is read forward and reverse, but since it it deteriorates/get damaged quickly the reverse (R2) is always worse than forward strand (R1).
  • View the first line in the zip file:
    zcat NAME.fastq.gz | head

4.1 Clean Raw data

  • Here you can choose from using Cutadapt, Trimmomatic and Trim_galore. For our analysis below we used Trim_galore. Trim_galore will remove adapters and primers (if present) and remove low quality sequences. (Trimmomatic can be run as a part of the Trinity!).

4.1.1 The SAGA Script

  • Made a script for Trim_galore that we called trim_galore.slurm and contained the following script:
#!/bin/bash
## Name the job, arbitrary
#SBATCH --job-name=trimgalore
## Project folder, every job must be accounted for
#SBATCH --account=nn9845k 
## Text file of the commands that will be sent to the queue
#SBATCH --output=slurm-%j.base 
## Every job requires a runtime limit
#SBATCH --time=10:00:00
## Number of cpus needed
#SBATCH --cpus-per-task=6
## Not needed unless more than one node is needed
##SBATCH --nodes=1
## Every job requires some specification of the memory (RAM) it needs
#SBATCH --mem-per-cpu=8G
## Not needed unless more than one node is used
##SBATCH --partition=normal
## If you ask for more than 180G mem per cpu you need to specify bigmen
##SBATCH --partition=bigmen


R1=$1 #Name for the first element
R2=$2 #Name for the second element

module purge
module load Trim_Galore/0.6.2-foss-2018b-Python-3.6.6
trim_galore $R1 $R2 --paired -j 6 #paired to R1 and R2, Tallet 6 må være det samme som cpus per task j = jobs/cpus.
  • OBS: New versions of Trim_Galore might be released and outdated on SAGA. The newest version is today (17.10.23) Trim_Galore/0.6.10-GCCcore-11.2.0.

4.1.2 For-loop

  • Place the script together with all the samples (in our example it will be in the folder 1_data). This directory is located in: /cluster/projects/nn9845k/transcriptomics/1_PkVRF01_He028/1_data

  • While being pwd in this folder with the Trim_galore.slurm script you can run the following command in the terminal:

for f in Sample_*; do
cp trim_galore.slurm $f; 
cd $f; 
sbatch trim_galore.slurm *R1_001.fastq* *R2_001.fastq*; 
cd ..; 
done
  • This code will do the following prior to each semicolon ;

    1. For every folder (f) that is called Sample_…
    2. Copy script called trim_galore.slurm to this current folder (f),
    3. Change directory to this folder (f),
    4. Run the script in this folder (f) using the two fastq files R1 and R2,
    5. Go back to the main folder
    6. done
  • Remember that any for-loops must start and end with do and done, respectively. Every command is seperated by ; or line-shift (above we did both).

  • R1 og R2 in the script refer to *R1_001.fastq* and *R2_001.fastq*

  • More info about trim_galore: https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/

  • After loading trim_galore (module load) you can run: trim_galore -help
    This might be helpful if you need to specify some parameters other than the standards given.

4.1.3 Copy files

  • Copy the trimmed sequences to a new folder and keep the folder structure. This code will go into the folder directory 1_data, go into all folders/directories that starts with Sample_ and look for all files that contain val and copy the file and folder structure to new directory called 2_trimmed_reads.
rsync -avR data/Sample_*/*val* 2_trimmed_reads

4.1.4 Creating softlinks (shortcuts)

  • This loop creates new directories and softlinks (shortcuts) of the trimmed reads. You can use this instead of copying if you want to save space and avoid having to copy the same files multiple times.
  • NB: This script generates a folder system for next step, which is using bowtie
for f in Sample_*; do
mkdir -p ./../3_bowtie_results/$f;
REF=$(readlink -f $f);
echo $REF;
BACK=$(pwd);
cd ./../3_bowtie_results/$f;
ln -s $REF/*val* .;
cd $BACK;
done
  1. This code will into every folder that starts with Sample_
  2. It will then go one up the folder structure .. and create a new directory 3_bowtie_results and within this directory it will create the folder structure of Sample_.
  3. It will then create a softlink of the files in Sample_ and save it as an object called REF.
  4. Next it will echo/print the object REF, which is every softlink it finds.
  5. It will save pwd (present-working-directory) as the object BACK.
  6. It will change directory to the newly created directory from step 2.
  7. In this directory it will make and save the links (ln) of the files called val in this current directory (.).
  8. Change directory back to the starting folder.

4.2 Bowtie - Map reads to viral genome

  • Several mapping programs can be used, but for now we will use bowtie to map to the genome and kallisto (next section) to map to the orfs (predicted genes). Kallisto allows quantification.

  • Bowtie cannot be used to map to host genome because it is not a splice aware mapper, which is an issue when using eukaryote genomes as reference. STAR is a good splice aware alternative.

  • Download the reference genome of PkV RF01 with annotation from:

wget https://raw.githubusercontent.com/RomainBlancMathieu/PkV-RF01/master/PkV-RF01_final.gff
wget https://raw.githubusercontent.com/RomainBlancMathieu/PkV-RF01/master/PkV-RF01_final.fnn
wget https://raw.githubusercontent.com/RomainBlancMathieu/PkV-RF01/master/PkV-RF01_final.fasta
wget https://raw.githubusercontent.com/RomainBlancMathieu/PkV-RF01/master/PkV-RF01_final.faa

4.2.1 Map to reference genome

  • After sequences have been trimmed by trim_galore we want to link/map the sequences to the virus genome by using bowtie.

  • Mapping requires that the reference is indexed. This command is performed in the folder where the genome is located. A couple of new files will be made starting with PkV-RF01_final_genome*.

module purge
ml Bowtie2/2.4.4-GCC-10.3.0
bowtie2-build PkV-RF01_final.fasta PkV-RF01_final_genome

4.2.2 The SAGA script

  • Next step is to place the bowtie.slurm script in the folder containing either a copy of trimmed galore sequences or softlinks to the trimmed galore sequences (see previous section of how we did this).

  • In the bowtie.slurm script you must set the location of the reference genome and the name you set in the bowtie-build command. Here we set the name as PkV-RF01_final_genome. The full location is /cluster/projects/nn9845k/transcriptomics/1_PkVRF01_He028/PkVRF01_annotation/Genome/PkV-RF01_final_genome

  • This is how your bowtie.slurm script should look like:

#!/bin/bash
#SBATCH --job-name=bowtie
#SBATCH --account=nn9845k
#SBATCH --output=slurm-%j.base
#SBATCH --time=1:00:00
#SBATCH --cpus-per-task=16
#SBATCH --mem-per-cpu=5G
#SBATCH --partition=normal

module purge
#module load Bowtie2/2.3.5.1-GCC-8.2.0-2.31.1
ml Bowtie2/2.4.4-GCC-10.3.0
ml SAMtools/1.12-GCC-10.3.0

REF=/cluster/projects/nn9845k/transcriptomics/1_PkVRF01_He028/PkVRF01_annotation/Genome/PkV-RF01_final_genome

bowtie2 -p 16 --no-unal -x $REF -1 *R1* -2 *R2* > mapped.sam

samtools sort -@ 16 mapped.sam > sorted.bam
samtools index sorted.bam
samtools view -h sorted.bam | cut -f 3,3 | grep 'PkV-RF01' | sort | uniq -c > counts.txt

4.2.3 For-loop

  • Now you must run this script for each sample with another for-loop
for f in Sample_*; do 
cp bowtie.slurm $f; 
cd $f; 
sbatch bowtie.slurm *R1* *R2*; 
cd ..; 
done

4.2.4 Results

  • For each sample the results will include:

    • sorted.bam: file with reads mapped/aligned to the reference genome
    • counts.txt: a simple summary of the number of reads mapping to the reference

4.3 Kallisto - Map reads to transcriptome


4.3.1 Map to CDSs

  • First we must make the database required for Kallisto to run properly. Use the CDS (Coding Domain Sequences) file of the genome (containing predicted ORFs). This command needs to be run in the location where you have your CDS-file: in this case PkV-RF01_final.fnn.
module purge
ml kallisto/0.46.1-foss-2020a
kallisto index PkV-RF01_final.fnn -i PkV-RF01_final_cds

4.3.2 Softlinks

  • Next step is to place the kallisto.slurm script in the folder containing either a copy or softlinks of trimmed galore sequences (see section 4.1.4 Creating softlinks). Example of softlink:
for f in Sample_*; do 
mkdir -p ./../4_kallisto/$f; 
REF=$(readlink -f $f); 
echo $REF; 
BACK=$(pwd); 
cd ./../4_kallisto/$f; 
ln -s $REF/*val* .; 
cd $BACK; 
done

4.3.3 The SAGA script

  • In the kallisto.slurm script you must set the location of the reference genome and the name you set in the kallisto-build command. Here we set the name as PkV-RF01_final_cds. The full location is /cluster/projects/nn9845k/transcriptomics/1_PkVRF01_He028/PkVRF01_annotation/PkV-RF01_final_cds

  • This is how your kallisto.slurm script should look like:

#!/bin/bash
#SBATCH --job-name=kallisto
#SBATCH --account=nn9845k
#SBATCH --output=slurm-%j.base
#SBATCH --time=12:00:00
#SBATCH --cpus-per-task=16
#SBATCH --mem-per-cpu=5G
#SBATCH --partition=normal

module purge
ml kallisto/0.46.1-foss-2020a

REF=/cluster/projects/nn9845k/transcriptomics/1_PkVRF01_He028/PkVRF01_annotation/PkV-RF01_final_cds
OUT=kallisto_results
R1=*R1_001_val_1*
R2=*R2_001_val_2*


kallisto quant -i $REF -t "$SLURM_CPUS_PER_TASK" -o $OUT $R1 $R2
  • Instead of specifying the number of cores in the kallisto script you can write a shortcut/reference instead. This way you will only need to change cpus-per-task without having to also change this number in the kallisto script. This is done by writing -t "$SLURM_CPUS_PER_TASK"

4.3.4 For-loop

  • Now you must run this script for each sample with another for-loop
for f in Sample_*; do 
cp kallisto.slurm $f; 
cd $f; 
sbatch kallisto.slurm; 
cd ..; 
done

4.3.5 Results

  • For each sample these output files are provided:

    • abundance.h5: a binary file with abundance estimates (used in the R analysis)
    • abundance.tsv: a plaintext table of the abundance estimate for each gene
    • run_info.json: information about the run
       
  • The abundance file (TSV) will display the two columns length and eff_length. The length column displays the full length of a specific CDS, while the eff_length displays the lenght that is unique/specific to that CDS. Many CDSs have several repetitive regions (eg. ATATATATATAT) which is not distinct/specific to that CDS. This is the reason why the mapped region (length) is often shorter than gene itself (eff_length).

  • The abundance file (H5) is just a binary file that can only be read by computers. The reason for this file is that is much more efficient for a computer to read such a file compared to a normal text-file that we can read.

  • Note: It is also possible to set bootstrap values for kallisto


  • Now you can use the grep command to fish out specific genes. e.g.:
grep -w "gene_195" Sample_*/kallisto_results/abundance.tsv

  • To visualize the data from kallisto you will have to normalize the data. This can be done ine R using a package called DESeq2 (See sections below with R).

4.3.6 Move results up one directory

  • Kallisto will generate a new directory within each “Sample_*” directory called “kallisto_results” (e.g. 4_kallisto/Sample_01-c01va/kallisto_results). To make it easier to import in R, it is nice to move all the results files inside this folder up one level or directory path, so all results files for each sample are directly inside the sample folder. We can create another for-loop that will go into all “kallisto_results” directories, take all its content and move it one level up so they are no longer “in a folder within a folder”. Move to 4_kallisto directory and run the following script:
for f in Sample_*; do
BACK=$(pwd);
cd $f;
cd kallisto_results;
mv * ..;
echo "${PWD}";
cd $BACK;
done

4.3.7 Remove files (before download)

  • Before you download the kallisto results to your computer, you want to remove any unnecessary files so it takes shorter time to download and takes up much less space on your computer. Before you do this I would recommend copying the kallisto results to your SAGA work environment, this way nothing is accidentally removed.

  • One option is to copy the files using rsync, which can exclude certain files upon transfer. We don’t want to copy the trimmed data as these files are quite large, so we can exclude these file extensions. The files look something like this 1-c01va_S4_L001_R1_001_val_1.fq.gz 1-c01va_S4_L001_R2_001_val_2.fq.gz, then we can exclude all files (*) ending with fq.gz:

rsync -av --exclude '*fq.gz' 4_kallisto /cluster/work/users/USERNAME

  • Now we can change directory (cd) to your work environment and remove files we don’t need.

  • We can remove; the script (kallisto.slurm), saga output statistics (slurm-4312668.base) and the empty folder (kallisto_results).

  • We use the rm command to remove files and add -r when we need to remove folders/directories.

  • Since both the script and saga statistics contain the characters slurm, we can use this to remove both files by including *slurm*. The * will overlook any characters following before or after this symbol, depending on if it is placed in the beginning or at the end, respectively. We will also use this symbol, *, to go through all our sample folders. -Below we show how to remove the script and saga output (first command), and the folder (second commmand):

rm Sample_*/*slurm*

rm -r Sample_*/kallisto_results

## Optional command
## Run command if you did not exclude these files using rsync:
rm Sample_*/*fq.gz

4.3.8 Download results

  • Once the files are cleaned up and of smaller size (see section above) you can download (scp) all this data to your computer to run DESeq2 in R.
  • To do this you must go into the terminal and navigate on your own computer (NOT ON SAGA) to the location you want to download all these files. You can also specify the filepath where you want to download, but here we just download in current or present working directory (pwd) by using the period symbol ..
scp -r nis005@saga.sigma2.no:/cluster/work/users/nis005/4_kallisto .

4.4 STAR - Map reads to transcriptome

  • STAR is another mapper you can use as an alternative to kallisto.
  • The reason you might want to choose STAR over kallisto is that STAR is a splice-aware software that you need in order to map reads to eukaryotic transcriptomes (due to introns and exons).

4.4.1 STAR database

  • The first step of running star is to make a star-database of the genome. The script is slightly different depending on if the genome is annotated or not. Below we show an example script for genome with annotation file.

  • You can copy/paste the code directly in the terminal, without creating a bash file through vim, but you must be located in the directory with your genome files.

#!/bin/sh
#SBATCH --job-name=star
#SBATCH --account=nn9845k
#SBATCH --output=slurm-%j.base
##SBATCH --nodes=1
#SBATCH --cpus-per-task=16
#SBATCH --time=48:00:00
#SBATCH --mem-per-cpu=6G
##SBATCH --partition=bigmem


START=$(date +%s)
echo "###################################################"
echo "# Started: "
echo "#     $(date)" 
echo "###################################################"
echo ""
module purge
module load STAR/2.7.10b-GCC-11.3.0

STAR --runMode genomeGenerate --sjdbGTFfile PkV-RF01_final.gtf --genomeDir PkV-RF01_stardb --genomeFastaFiles PkV-RF01_final.fasta --genomeSAindexNbases 9 --runThreadN 16

# Computing runtime
secs=$(($(date +%s)-$START))
echo ""
echo "######################"
echo "Script finished: "
echo "    $(date)"
echo "Running time:"
#echo "#     $(($(date +%si)-$START)) seconds"
printf '%dd:%dh:%02dm:%02ds\n' $(($secs/86400)) $(($secs%86400/3600)) $(($secs%3600/60)) $(($secs%60)) 
echo "######################"

  • If you only have a genome without annotation, you can use the same script as above, but change the STAR command line with:
STAR --runMode genomeGenerate --genomeDir --genomeFastaFiles Haptolina_ericina_var_UIO028.mainGenome.fasta --genomeSAindexNbases 9 --runThreadN 16

4.4.2 STAR slurm

  • Start by making softlinks of your trimmed transcriptomics data, in the same manner as in section 4.1.4 Creating softlinks. Here is an example:
for f in Sample_*; do 
mkdir -p ./../5_star_results/$f; 
REF=$(readlink -f $f); 
echo $REF; 
BACK=$(pwd); 
cd ./../5_star_results/$f; 
ln -s $REF/*val* .; 
cd $BACK; 
done

  • Copy your STAR script to your 5_star_results directory or generate a new with vim star.slurm and copy the script below. Remember to change the GENOME=... directory of the star database to match your folder location.
#!/bin/sh
#SBATCH --job-name=star
#SBATCH --account=nn9845k
#SBATCH --output=slurm-%j.base
##SBATCH --nodes=1
#SBATCH --cpus-per-task=16
#SBATCH --time=48:00:00
#SBATCH --mem-per-cpu=6G
##SBATCH --partition=bigmem


START=$(date +%s)
echo "###################################################"
echo "# Started: "
echo "#     $(date)" 
echo "###################################################"
echo ""
module purge
module load STAR/2.7.10a_alpha_220818-GCC-10.3.0 

GENOME=/cluster/projects/nn9845k/transcriptomics/1_PkVRF01_He028/star_database/PkV-RF01_stardb
echo "GENOME is $GENOME"

R1=$1
R2=$2
echo "R1 and R2 are: " $R1 $R2


STAR --genomeDir $GENOME --readFilesIn $R1 $R2 --outSAMtype BAM Unsorted --runThreadN 16 --readFilesCommand zcat --outReadsUnmapped Fastx



# Computing runtime
secs=$(($(date +%s)-$START))
echo ""
echo "######################"
echo "Script finished: "
echo "    $(date)"
echo "Running time:"
#echo "#     $(($(date +%si)-$START)) seconds"
printf '%dd:%dh:%02dm:%02ds\n' $(($secs/86400)) $(($secs%86400/3600)) $(($secs%3600/60)) $(($secs%60)) 
echo "######################"

  • Now you must run this script for each sample with another for-loop
for f in Sample_*; do 
cp star.slurm $f; 
cd $f; 
sbatch star.slurm *R1* *R2*; 
cd ..; 
done

4.4.3 Download results

  • To download the star result, we recommend first deleting unnecessary files (as some can be quite large) before you download. See sections 4.3.7 Remove files (before download) and 4.3.8 Download results.
  • It might also be necessary to run R scripts on SAGA, because of large output files. Please follow the steps under section 5.1.2 STAR - Import dataset to R.




5 R - Normalize and visualize data

  • Please click here for more information about RNA sequences.

  • Please download the R scripts here.

  • Then do the following:

    1. Extract zip file
    2. Navigate to R_data folder
    3. Open R_data.Rproj
    4. Open DESeq2_*.R files in the lower right panel in RStudio

  • Note: In the following sections we will use the file Metadata_He028_PkVRF01.xlsx that contains the metadata of experiment 1 with He028 infected with PkV RF01.

  • Start by loading all necessary packages.

library("tximport")
library("DESeq2")
library("tximportData")
library("tidyverse")
library("ggplot2")
library("gplots")
library("vsn")
library("IHW")
library("readxl")
library("rhdf5")
library("apeglm")
library("Rsubread")
library("ggeasy")
library("ggpubr")
library("Rmisc")

  • If installation of packages is required, use the command below and enter the packages that are missing in your library.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("INSERT_PACKAGENAME_HERE")

  • If you need to install all packages, use the command below:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("DESeq2","tximport","tximportData","tidyverse","ggplot2", "gplots","vsn","IHW","readxl","rhdf5","apeglm", "Rsubread", "ggeasy", "ggpubr", "Rmisc"))
## Bioconductor version 3.17 (BiocManager 1.30.22), R 4.3.1 (2023-06-16)

## Warning: package(s) not installed when version(s) same as or greater than current; use
##   `force = TRUE` to re-install: 'DESeq2' 'tximport' 'tximportData' 'tidyverse'
##   'ggplot2' 'gplots' 'vsn' 'IHW' 'readxl' 'rhdf5' 'apeglm' 'Rsubread' 'ggeasy'
##   'ggpubr' 'Rmisc'

## Old packages: 'markdown'

5.1 DESeq - normalize data

  • In order to make all data comparable we must start by normalizing it. We will be using a package called DESeq2 to do this.

  • In the following two sections we have shown how to import dataset from both Kallisto and STAR outputs.

  • The import steps are slightly different between Kallisto and STAR, but the rest of the analyses should be the same.

  • First we must download the results from Kallisto and/or STAR from Saga and onto your own computer (if you haven’t already). Below is an example of downloading the results from Kallisto, but remember to change to your own username and correct file path.

scp -r USERNAME@saga.sigma2.no:/cluster/projects/nn9845k/transcriptomics/1_PkVRF01_He028/4_kallisto_results .
  • The -r tell the scp to copy all files and folder from the given directory to current directory by the period symbol ..

  • OBS: It might be necessary to remove certain files before download, as some files can be quite big. Please see sections 4.3.7 Remove files (before download) and 4.3.8 Download results.


5.1.1 Kallisto - Import dataset to R

I. IMPORTING

  • All R-codes for importing data from kallisto is included in DESeq2_kallisto.R.

  • We must start by reading all the abundance files (from Kallisto output) into R.

  • In order to do this we have to make a list containing all the sample names and the directories for each abundance file:

files <- list.dirs("1_PkVRF01_He028_kallisko") %>% .[-1] %>% file.path(.,"abundance.h5")

  • The above command will make a list of the file paths for each abundance file and should look something like shown below. You can check the file by typing head(files) in the R Console
"1_PkVRF01_He028_kallisko/Sample_01-a01va/abundance.h5"
"1_PkVRF01_He028_kallisko/Sample_02-a01vb/abundance.h5"
"1_PkVRF01_He028_kallisko/Sample_03-a01vc/abundance.h5"
...

  • We then make a separate list containing only the sample/folder name for each abundance file by removing the string (or characters) before and after the / symbol:
samples <- str_remove(files, "/abundance.h5") %>% str_remove("1_PkVRF01_He028_kallisko/")

  • The samples list will look similar to the below example when typing head(samples) in the R Console:
Sample_01-a01va" "Sample_02-a01vb" "Sample_03-a01vc" ...

  • Then we can link all these sample names (samples) to each file path (files)
names(files) <- samples

  • And files should look like this:
Sample_01-a01va
"1_PkVRF01_He028_kallisko/Sample_01-a01va/abundance.h5"
Sample_02-a01vb
"1_PkVRF01_He028_kallisko/Sample_02-a01vb/abundance.h5"
Sample_03-a01vc
"1_PkVRF01_He028_kallisko/Sample_03-a01vc/abundance.h5"
...

  • With the file path information from files, we can import the dataset by using the package tximport. tximport can also take other input data than Kallisto. Please see ?tximport for full documentation.
txi.kallisto <- tximport(files, type = "kallisto", 
                         txOut = TRUE, ignoreAfterBar = T)

  • txi.kallisto contains all sample names, genes and their abundances, and should look like this:
Sample_01-a01va Sample_02-a01vb Sample_03-a01vc ...
gene_1           849.0528        899.5036
gene_2           565.3992        611.5036
gene_3           181.4811        207.1772
gene_4           509.1173        554.5036
gene_5           167.4460        194.4977
gene_6           245.7182        279.0149
gene_7           500.1173        545.5036
gene_8          1263.0528       1313.5036
gene_9           319.0752        359.1965
gene_10          942.0528        992.5036
gene_11          423.2859        467.7722
gene_12          852.0528        902.5036
gene_13          780.0528        830.5036
...

II. VISUALIZE PSEUDOCOUNTS

  1. You can visualize the pseudocounts before the data is normalized by using the base heatmap in R.

  2. You can also look at clustering based on similarities of expression levels. This can be done by excluding Colv and Rowv.

  3. Additionally you can look at abundances instead of pseudocounts

heatmap(txi.kallisto$counts, Colv=NA, Rowv = NA) # 1

heatmap(txi.kallisto$counts) # 2

heatmap(txi.kallisto$abundance, Colv=NA,  Rowv = NA) # 3

III. NORMALIZING DATA

  • Before we normalize the data in DESeq2, we load in a metadataset that gives the experimental design:
metadata <- readxl::read_xlsx("Metadata_He028_PkVRF01.xlsx") # Read in metadata (new version)

  • Next step is to create the DESeq2 object.

  • We define the design of the experiment based on the metadataset, which in our case is:

    • Treatment, Replicate and Timepoints
  • DESeq will automatically detect and treat replicates as replicates and will also make the data normalized per library.

dds <- DESeqDataSetFromTximport(txi.kallisto, #the imported abundance files
                                colData = metadata, #Read metadataset by above object
                                design=~Treatment+Replicates+Timepoints) #The experimental design

  • The data can now be normalized by this function:
dds <- DESeq(dds)

  • The function uses generalized linear model to do differential expression on the data.
  • Note that this is the first step in normalization and it might not handle/capture really low/high input values that well. For this you would need a more complex model.

IV. TIPS

  • To avoid having to run all the above commands every time you enter R, you can save the dataset instead. There are two methods that can be used.
saveRDS(dds, "dds.rds") # so you can load this datamatrix 

#OR you can save Rdata (everything stored in the R-Environment):
save.image("all.Rdata") #save data-working environment
load("all.Rdata") #If you have cleared environment you can just load it here!

5.1.2 STAR - Import dataset to R

I. INTERACTIVE R ON SAGA

  • All R-codes for importing data from STAR bam files is included in DESeq2_STAR.R.

  • Loading the data in STAR is slightly different compared to Kallisto, but the final step of normalization is the same.

  • If you are performing STAR normalization for the host genome, the files will probably be too big to run the analyses on your own computer. Fear not, we can start an interactive computing node on SAGA that will allow us to run R-script on saga:

srun --account=nn9845k --mem-per-cpu=10G --time=10:00:00 --cpus-per-task=8 --pty bash -i

  • Once the job has gone through the queue you can load R on SAGA:
module load R/4.1.2-foss-2021b

II. IMPORTING

  • Whether you run the R-script on your computer or on SAGA, the following steps will be the same.

  • You start by loading library("Rsubread")

  • If library is not installed, please run:

if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("Rsubread")

Next step is to make a list of the file directory of each “Aligned.out.bam” file from STAR:

files <- list.dirs("star_bam_results") %>% .[-1] %>% file.path(.,"Aligned.out.bam")

  • We use featureCounts to read the in both the annotation file and all the “Aligned.out.bam” files:
fc <- featureCounts(files,
                    annot.ext = "star_bam_results/PkV-RF01_final.gtf", 
                    isGTFAnnotationFile = TRUE,
                    isPairedEnd = TRUE)

III. RENAMING

  • To make the data easier to interpret we can rename the rownames and column names to more logical names.

  • Instead of every column names being the full file path, we can rename it to just the file name.

  • The command below will remove the folder name star_bam_result and the file name Aligned.out.bam and contain only the unique file name Sample_XX.... Then it will take this list and apply it to all column names of fc$counts.

samples <- str_remove(files, "/Aligned.out.bam") %>% str_remove("star_bam_results/")

colnames(fc$counts) <- samples

  • We can also change row names to its specific gene number. We can do this either:
    1. by using the rownames from the kallisto dataset
    2. or we can manually insert gene numbers as shown below.
rownames(fc$counts) <- rownames(dds) # 1
rownames(fc$counts) <- paste0("gene_",1:nrow(fc$counts)) # 2

IV. VISUALIZING

  • The pseudocounts can be visualized, same as for kallisto.
heatmap(fc$counts, Colv=NA, Rowv = NA)
heatmap(fc$counts)
heatmap(fc$abundance, Colv=NA,  Rowv = NA)

V. NORMALIZING

  • The next step is to load the metadata:
metadata <- readxl::read_xlsx("Metadata_He028_PkVRF01.xlsx")

  • Now you can create the DESeq object and normalize the data:
dds <- DESeqDataSetFromMatrix(fc$counts, 
                              colData = metadata, 
                              design=~Treatment+Replicates+Timepoints)

dds <- DESeq(dds)
  • The following sections on visualizing and extracting the data should now be the same for both STAR and kallisto results.

5.2 Extracting data

  • All R-codes for extracting and visualizing data from either kallisto or STAR is included in DESeq2_visualize.R.
  • The data is now normalized and we can have a closer look at specific time points, p-values, genes etc.

5.2.1 Comparing time points

  • We can start by comparing different time points and we therefore import the comparisons based on the design from the metadata
resultsNames(dds)

  • Now we pick two timepoints we want to compare based on the above list e.g. “Timepoints_13_vs_0”. This will generate a list with all genes and show whether or not these are differentially expressed based on the p-values
res <- results(dds, name = "Timepoints_13_vs_0")
res

  • Next we can log form data to remove any extreme values (both high and low).
resLFC <- lfcShrink(dds, coef="Timepoints_13_vs_01", type="apeglm")
resLFC

  • Then we can reorder these results to show the smallest p-values at top and compare resLFC with res. Note that these are the most differentially expressed genes between time point 0 and 13
resOrdered <- res[order(res$pvalue),]
resOrdered_2 <- resLFC[order(resLFC$pvalue),]

  • We can also see how many adjusted p-values that were less than 0.01
sum(res$padj < 0.01, na.rm=TRUE)
sum(resLFC$padj < 0.01, na.rm=TRUE)

  • Additionally it is possible to apply custom functions for performing independent filtering and p-value adjustment using results.
  • Below we add Independent Hypothesis Weighting (ihw) as filtering:
resIHW <- results(dds, filterFun=ihw)
summary(resIHW)
sum(resIHW$padj < 0.05, na.rm=TRUE)
metadata(resIHW)$ihwResult

5.3 Visualize data

  • The next step is to visualize the data and visually evaluate how the normalized data looks.

5.3.1 Mean of normalized counts

  • We can use a simple helper function that makes a so-called “MA-plot”, i.e. a scatter plot of log2 fold changes (on the y-axis) versus the mean of normalized counts (on the x-axis).

  • Below we plot both the res data and the log formed data resLFC

plotMA(res, ylim=c(-2,2)) # Scewed because only looking at viral data, should filter out control
plotMA(resLFC, ylim=c(-2,2))

  • If the data looks skewed it is likely because you only look at part of the data. E.g. if you look at the viral transcriptome it will also include the control data, which should not contain much viral signals. Filtering out your controls should help make the plot appear evenly spread.

5.3.2 Normalized expressed genes

  • Next we can look at specific genes and how the expression changes over time or compare treatments or replicates.
  • We start by plotting the gene that changes the most over time (lowest p-value).
plotCounts(dds, gene=which.min(res$padj), intgroup="Timepoints")

  • We can also input a specific gene number to see how the normalized expressed gene changes over time. Below we looked at genes 892, 940 and 637.
plotCounts(dds, 892, intgroup="Timepoints")
plotCounts(dds, 940, intgroup="Timepoints")
plotCounts(dds, 637, intgroup="Timepoints")

  • To compare treatments of a specific gene you just change “Timepoints” with “Treatment”. Below compare control with infected culture using gene 637.
plotCounts(dds, 637, intgroup="Treatment")

  • You can also include multiple parameters such as both Treatment and Replicates
plotCounts(dds, 637, intgroup=c("Replicates", "Treatment"))

5.3.3 Change plot esthetics

  • To make your plot look better visually you can use a package called ggplot2.
  • We start by saving the previous plot to an object and add this to ggplot. Below we show example of the most differentially expressed gene and a specific gene (gene_637).
p1 <- plotCounts(dds, gene=which.min(res$padj), intgroup="Timepoints", 
                returnData=TRUE)
                
p2 <- plotCounts(dds, 637, intgroup=c("Timepoints","Treatment","Replicates"),
                 returnData=TRUE)
                 
ggplot(p1, aes(x=Timepoints, y=count)) + #swap p1 with p2 to see other plot
  geom_point(position=position_jitter(w=0.1,h=0.0)) + 
  scale_y_log10(breaks=c(25,100,400))

  • We can build further to these plots by adding colour to treatment.
  • We need to load two packages ggeasy and ggpubr.
library("ggeasy")
library("ggpubr")
ggplot(p1 , aes(x=Timepoints, y=count, color = factor(Treatment))) + # Can change Treatment to Replicate
  geom_point(position=position_jitter(w=0.1,h=0.0)) +
  easy_add_legend_title("Treatment") +
  ggtitle(paste("Gene xxx"))

### Can also add logarithmic scale if necessary
# + 
#scale_y_log10(breaks=c(25,100,400))

5.3.4 Display multiple plots

  • You can use a for-loop to display multiple plots. Below we exemplify this by displaying a matrix of 9 plots side by side of 9 different genes.
  • First we must make a list of the genes of interest. We show example of using the top 9 most differentially expressed genes and a selection of genes.
  • We use the ordered genes (resOrdered) and extract all the gene numbers in chronological order from most to least differentially expressed.
ordered_genes <- rownames(resOrdered) %>% 
  str_split_fixed("_",2) %>% .[,2] %>%  as.numeric()
  
genes_of_interest <- ordered_genes[1:9] # Change to e.g. 10:19 to get the next 9 genes in the ordered list

  • To select specific genes you simply make a data frame with these genes. Here we picked the four genes 32, 556, 25 and 50
genes_of_interest_2 <-  c(32,556,25,50)

  • Now we can create a list of our plots, and we do this by first creating an empty list followed by the for-loop that generates the desired plots and add them chronological to this list.
list_of_plots <-list()

for(i in 1:length(genes_of_interest))
{
  list_of_plots[[i]]  <- plotCounts(dds, genes_of_interest[i], 
                                    intgroup=c("Timepoints","Treatment","Replicates"),
                                    returnData=TRUE) %>% 
    ggplot(aes(x=Timepoints, y=count, color = factor(Treatment))) + 
    geom_point(position=position_jitter(w=0.1,h=0.0)) +
    easy_add_legend_title("Treatment") +
    ggtitle(paste("Gene",  genes_of_interest[i]))
}

### This line will arrange the plot 3 x 3
### Change the ncol and nrow if you have more/less plots
ggarrange(plotlist=list_of_plots, ncol = 3, nrow = 3)

  • You can save this plot by using ggsave
ggsave("top9genes.pdf", width = 15, height = 15)

5.3.5 Adding SE to plot

  • We can calculate the mean of the replicates and display standard error as error bars in the plot. For this we need to load Rmisc.
  • We use a function called summarySE on our data and define column name of the variable (gene count) and which columns that can be grouped (Time and Treatment).
library(Rmisc)
p1.1 <- p1 %>% summarySE(measurevar="count", groupvars=c("Timepoints","Treatment"))

  • Now we can plot the data with SE and we only need to add one function to the plot called geom_errorbar().
ggplot(p1.1 , aes(x=Timepoints, y=count, color = factor(Treatment))) + # Can change Treatment to Replicate
  geom_point() +
  geom_errorbar(aes(ymin=count-se, ymax=count+se, color = Treatment), width=1) +
  easy_add_legend_title("Treatment") +
  ggtitle(paste("Gene xxx")) +
  theme_classic() #Optional to make the plot look nicer

  • We can SE to the matrix of plots using the for-loop as previously shown
list_of_plots <-list()

for(i in 1:length(genes_of_interest))
{
  list_of_plots[[i]]  <- plotCounts(dds, genes_of_interest[i], 
                                    intgroup=c("Timepoints","Treatment","Replicates"),
                                    returnData=TRUE) %>% summarySE(measurevar="count", groupvars=c("Timepoints","Treatment")) %>% 
    ggplot(aes(x=Timepoints, y=count, color = factor(Treatment))) + 
    geom_point() +
    geom_errorbar(aes(ymin=count-se, ymax=count+se, color = Treatment), width=.5) +
    easy_add_legend_title("Treatment") +
    ggtitle(paste("Gene",  genes_of_interest[i])) +
    theme_classic()
}

ggarrange(plotlist=list_of_plots, ncol = 3, nrow = 3)

5.3.6 Volcanoplot

  • It is possible to display the log2 fold change in gene expression against its adjusted p-value. Note that the y-axis (-log10(padj)) is inverse, meaning the higher the value, the more differentially expressed the gene is.
  • The plot should display a nice correlation between log2 fold change and adjusted p-value.
df <- resOrdered %>% as.data.frame()

ggplot(df, (aes(x = log2FoldChange, y = -log10(padj)))) +
  geom_point(alpha=0.4, size=0.5) +
  geom_vline(xintercept=c(-1,1),lty=4,col="black",lwd=0.2) +
  geom_hline(yintercept = 1.301,lty=4,col="black",lwd=0.2) +
  labs(x="log2(fold change)",
       y="-log10 (adj.p-value)",
       title="Volcanoplot")  +
  theme_bw()+
  theme(plot.title = element_text(hjust = 0.5), 
        legend.position="right", 
        legend.title = element_blank())

5.3.7 Build table

  • Reading the data you can include an additional argumen that specifies what comparison to extract from the object to build a results table.
colData(dds)
resMFType <- results(dds, contrast=c("Timepoints","02","ar"))

5.3.8 meanSdPlot

  • You can evaluate the data by plotting the row standard deviations versus the row means.
  • You have to start by transforming the DESeq2 matrix to a format that is possible to plot.
ntd <- normTransform(dds) #Transforming data
meanSdPlot(assay(ntd)) #plotting sd against means

5.3.9 Heatmap

  • We can use pheatmap to visualize gene expression, where the function will cluster and aggregate rows.
  • The function also allows to aggregate the rows using k-means clustering. Click here for more information about k-mean clustering
  • This is advisable if number of rows is so big that R cannot handle their hierarchical clustering anymore, roughly more than 1000.
  • We start by visualizing the top 20 most expressed genes by
    1. Extracting count data
    2. Order gene expression in decreasing order and picking top 20 most expressed genes.
    3. We can also use all genes, but cluster using k-means and selecting all genes
    4. and finally make a metadata overview for pheatmap
select <- order(rowMeans(counts(dds,normalized=TRUE)), # 1
                decreasing=TRUE)[1:20] # 2
                
select.all <- order(rowMeans(counts(dds,normalized=TRUE)), # 3
                decreasing=TRUE)
                
df <- as.data.frame(colData(dds)[,c("Replicates","Timepoints")]) # 4

  • Now we can display the heatmap and adjust the following:
    1. Reads in the transformed data from previous section ntd and extracts only the top 20 genes from the object select. Here we can change select with select.all, but the we also must add k-means (see point 5). We can also change select with gene_of_interest, which we created earlier. This will show the most differentially expressed genes instead of the most expressed genes.
    2. Clusters the rows, which will group expressed genes and you can possibly include 100s of genes in this type of cluster.
    3. Displays/hides gene/cluster names, change to FALSE if you want to hide it
    4. Clusters the columns, which will group the time points. If the time scale order is the same as before grouping, then you can have good confidence in your data ;)
    5. If you use object select.all there are too many genes for R to handle and we must cluster the genes using kmeans_k. The number can be set to the desired number of clusters.
    6. Annotation names from the recently created metadata df
pheatmap(assay(ntd)[select,], # 1
         cluster_rows=FALSE, # 2
         show_rownames=TRUE, # 3
         cluster_cols=FALSE, # 4
         #kmeans_k = 20, # 5
         annotation_col=df) # 6

5.3.10 Principal Component Analysis (PCA)

  • We can run a Principal Component Analysis (PCA) to visualize the in a simpler format. You must define the dataset ntd, the intgroup that is the group and can be Timepoints, Replicate, Treatment etc, and you must define ntop that is the number of top genes to use for principal components, selected by highest row variance.
  • You should hopefully see a clustering of early, mid and late expressed genes. If so you can make an elipse around the different time phases.
plotPCA(ntd, intgroup="Timepoints",ntop = 500) 
plotPCA(ntd, intgroup="Replicate",ntop = 1100) #Replicate and all genes

5.3.11 Excluding the control

  • We can remove the control from our sample and subset only to the infected cultures. This is done by looking for “v” in our metadata:
dds_vir <- dds[, dds$Treatment %in% "v"]

  • It is also possible to extract specific timepoints using the above command:
dds_01_and_14 <- dds[, dds$Timepoints %in% c("01", "14")]

  • We can also exclude the resistant cultures ar from our dataset and dds_vir will now only contain infected cultures in the active infection cycle:
`%notin%` <- Negate(`%in%`)
dds_vir<-dds_vir[, dds_vir$Timepoints %notin% "ar"]

  • Now we can run all of the above analyses and plots, but without the data from controls and resistant cultures.
resultsNames(dds_vir)
res_vir <- results(dds_vir, name = "Timepoints_13_vs_01")
resLFC_vir <- lfcShrink(dds_vir, coef="Timepoints_13_vs_01", type="apeglm")
resLFC_vir

plotCounts(dds_vir, gene=which.min(res_vir$padj), intgroup="Timepoints")
d <- plotCounts(dds_vir, gene=which.min(res_vir$padj), intgroup="Timepoints", 
                returnData=TRUE)

ggplot(d, aes(x=Timepoints, y=count)) + 
  geom_point(position=position_jitter(w=0.1,h=0.0)) + 
  scale_y_log10(breaks=c(25,100,400))

ntd_vir <- normTransform(dds_vir)
meanSdPlot(assay(ntd_vir))
pheatmap(assay(ntd_vir)[select,], 
         cluster_rows=FALSE, 
         show_rownames=TRUE, 
         cluster_cols=FALSE,
         annotation_col=df)

plotPCA(ntd_vir, intgroup="Timepoints",ntop = 1100)




6 R - Extract genome sequences from annotations

  • Introduction to stringr: https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html

  • If you only have the sequence position of the CDS regions, but not the sequences themselves you can extract them yourself.

  • First we must load tidyverse library(tidyverse)

  • Then you must load the full sequence genome and the data sheet where you have your CDS sequence position numbers. You might have an excel sheet containing a column with gene number, sequence position, pos/neg strand etc like this: gene_1|GeneMark.hmm|629_aa|-|2|1888. Here you will find the sequence position starts at 2 and ends at 1888. You need to save this excel sheet as a csv file in order for R to read it properly. Below I have shown this together with selecting only the column that contains the neccessary info about the CSD regions.

annotation.df <- read.csv("PkVRF02.csv", sep=";") %>% select(CDS_regions)

genome_seq.df <- read_lines("PkV-RF02_v2.fasta")

  • Next step is to isolate the numbers inside each row containing the CDS regions. Below we use the function separate to do this, and since we do not specify the separator symbol it will separate on any common separator type (- / | . , _ etc). If we take the first row as an example, it will split gene_1|GeneMark.hmm|629_aa|-|2|1888 into eight new columns “Gene”,“Nr”, “Gmark”, “Hmm”, “aa”, “pos_neg”, “from”, “to”. Then we select only the from and to column that contains the position data.
  1. Gene - gene
  2. Nr - 1
  3. Gmark - GeneMark
  4. Hmm - hmm
  5. aa - aa
  6. pos_neg - -
  7. from - 2
  8. to - 1888
position.df <- annotation.df %>% 
  separate(col=CDS_regions, into=c( "Gene","Nr", "Gmark", "Hmm", "aa", "pos_neg", "from", "to")) %>% 
  select(c("from", "to"))

  • Now we have isolated the CDS regions into one table and we can look into the genome sequence. To start with you should check that he whole genome sequence is listed as one continuous line or string. Often the sequence will listed over multiple lines with a line shift and then you will need to remove this line shift in order to easily access the position on the whole sequence. We can remove all line shifts by using a for-loop where we will go into every line and save/concatenate it as a continuous string/line.
#nchar()
#str_length()
#str_sub(x, from, to)

genome_clean = ""                                             #1
for (i in 2:(length(genome_seq.df)+1) ) {                     #2
  #if (i < 10) {                                              #3 
    temp <- paste(genome_clean, genome_seq.df[i], sep = "")   #4
    genome_clean <- temp                                      #5
  #} #end of if
}

  1. Empty string, so we can use it to add strings to it
  2. for every nr “i” from 2 (excludes heading) to total number of lines in genome_seq.df + 1 (to include last line)
  3. you can test with only the first 10 lines by using “if”
  4. concatenate string in position “i” without separation to the object genome_clean
  5. save this concatenated string to genome_clean so it can be used in the next loop in step 4.

  • Now we have a continuous genome sequence and the table with CDS regions/positions. In this step we will make an fnn file that contains the name of the the sequence followed by the nucleotide sequence itself.
PkV_cds = ""                                                                          #1
for (p in 1:(nrow(position.df) )) {                                                   #2
  temp.2 <- paste(PkV_cds, ">", annotation.df[p,1], " >PkV-RF02", "\n", sep = "")     #3
  temp.3 <- str_sub(genome_clean, position.df[p,1], position.df[p,2])                 #4
  PkV_cds <- paste(temp.2, temp.3, "\n", "\n", sep = "")                              #5
}
cat(PkV_cds)                                                                          #6
#View file with line shift cat()
write(PkV_cds, file = "PkV-RF02.fnn")                                                 #7
  1. Empty string
  2. for every number “p” from 1 to number of rows in position.df, do the following
  3. Creates heading that starts with “>”, add gene nr, add ” >PkV-RF02 and end with line shift
  4. Adds sequence “from” column “1” “to” column “2” for every row “p”
  5. Concatenate step 3 and 4 into object PkV_cds (concatenate heading and sequence with line shift)
  6. View output
  7. Write to file




7 GIT

7.1 Clone git repository

  • Go to github and find the repository you want to dowload/clone. Click on the green Code button and copy HHTPS link and paste it as below:
git clone https://github.com/krabberod/VirVar_transcriptomes.git

7.2 Clone git wiki page

Start by making the wikipage by creating a new one in the repository on github.com.

In the terminal type the following to clone wiki locally in pwd:

git clone https://github.com/YOUR_USERNAME/YOUR_REPOSITORY.wiki.git

Make the necessary modifications you need and to commit and push changes do the following:

cd /path/to/local/cloned/wiki
git add .
git commit -m "new files"
git push

7.3 Removing previous commits

  • https://www.youtube.com/watch?v=lVpLoUecZYY

  • Clone git repository in the Github app.

  • In the Terminal window move or cd to git folder. e.g. Documents/Github/Transcriptomics.

  • Remove the two last commits. Change number if you want to remove more/less

git reset --soft HEAD~2
  • Then force push this commit to the relevant branch (in our example its the main branch)
git push origin +main --force

  • Removing an entire commit
    I call this operation “cherry-pit” since it is the inverse of a “cherry-pick”. You must first identify the SHA of the commit you wish to remove. You can do this using gitk –date-order or using git log –graph –decorate –oneline You are looking for the 40 character SHA-1 hash ID (or the 7 character abbreviation). Yes, if you know the “^” or “~” shortcuts you may use those.
git rebase -p --onto SHA^ SHA
  • Obviously replace “SHA” with the reference you want to get rid of. The “^” in that command is literal.

  • However, please be warned. If some of the commits between SHA and the tip of your branch are merge commits, it is possible that git rebase -p will be unable to properly recreate them. Please inspect the resulting merge topology gitk –date-order HEAD ORIG_HEAD and contents to ensure that git did want you wanted. If it did not, there is not really any automated recourse. You can reset back to the commit before the SHA you want to get rid of, and then cherry-pick the normal commits and manually re-merge the “bad” merges. Or you can just suffer with the inappropriate topology (perhaps creating fake merges git merge –ours otherbranch so that subsequent development work on those branches will be properly merged in with the correct merge-base).




8 Other notes

  • Final step prior to analysis in R is to unzip the kallisto files

  • DESeq do normalization of samples

  • Substract the virus mapping reads from the complete reads to get all host reads
    See differences of host reads between infected vs control

  • samtools go into manual, to change mapping cutoff, do not need to rerun analysis

  • sam file textfile (larger file)

  • bam file compressed samfile, only readable by machine

  • RDA to explain the variation over time, or night vs day etc.

  • Combat-seq R package to normalize data when for instance you have data that are not from the same experiment or the data between replicates is “bulky”. For instance we should use this package if we want to compare data from two different infection experiment. https://github.com/zhangyuqing/ComBat-seq

  • WGCNA: weighted gene co-expression network analyses. Can be used to correlate genes.

  • Google WGCNA with DeSeq2, to use the normalized data from Deseq2.

  • WGCNA can also normalize data, but should be avoided as it is better to use the already normalized data from DESeq2

  • Possibly generate a table that can be used in network analysis in cytoscape

  • Look into exportNetworkToCytoscape in WGCNA

Clone this wiki locally