-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Marius Saltvedt 9/28/2021
Use more
or less
command followed by filename to view text-files
vim
to edit text file
- press
i
to insert, start writing - press
esc
followed by:
to insert command - press
:wq
to write and quit - press
:q
or:q!
to quit or forcequit without saving, respectively
column -t
will help display a nice table, but if it is a long table
you can choose to display only some of it by using the head
command,
followed by number of rows (by default its only 10 rows).
<tablefile head -n 100 | column -t
du -sh filename
You can use the mv
command to both just change filename, and use it to
mve files from one location to another. Below I have just shown an
example of changing the filename of an existing file.
mv file.txt new_file.txt
You can copy and/or move files by using cp
or mv
, respectively. To
move a file you write mv file1 /this/folder/directory
. You do the
exact same for copying, where you start with the command and specify
what file you want to copy and where you want to copy it to
cp file1 /this/folder/directory
. If you only want to copy certain
files within many folders, but keep the folder structure you must add
--parents
. You can also use another copy function rsync
to also
exclude specific file names/extentions. Below are a few examples:
mv 4_kallisto /cluster/projects/nn9845k/transcriptomics/3_PkVRF01_Pk033
cp 4_kallisto /cluster/projects/nn9845k/transcriptomics/3_PkVRF01_Pk033
cp --parents 5_star_results/Sample_*/Aligned.out.bam /cluster/work/users/nis005
rsync -av --exclude '*fq.gz' 5_star_results /cluster/work/users/nis005
There are several ways you can search and retrieve specific information
from a file, but the grep
command is commonly used. If you want
specific row numbers you can use sed
. You can save the results to
another textfile by using search > filename
. However, this will
overwrite anything existing in this file. You can append the search to
this file by using two >>
instead.
This find line 7 to 10 in the test.csv file and then append to test.txt file.
sed -n '7,10p' test.csv >> test.txt
If you are interested in only creating a list with all file names or
folder names from a directory you can use the ls
command and write the
output to a text file:
ls >> Out_file.txt
While being in a directory you can save everything in that directory as
an object. Below we have called that object files
, where the (*)
symbolise to save everything in pwd
. To call this object you must use
$
prior to the name. Any element of an array may be referenced using
following syntax: ${ArrayName[subscript]}
. You can also make up an
array ArrayName=("element 1" "element 2" "element 3")
. Lets imagine
that files
contain files=("1_data" "2_analysis" "3_script")
. Now we
can display anything in the object files
by using the order of them
(starting with 0 as the first element).
files=(*)
echo "${files[0]}" #will print 1_data
echo "${files[2]}" #will print 3_script
If you need to know how many elements that are present in your object
you can use ${#ArrayName[@]}
echo ${#files[@]}
#will print out the number 3
# Because it contains three elements (1_data 2_analysis 3_script)
If subscript is @
or *
, the word expands to all members of name. By
prefixing #
to variable you will find length of an array (i.e number
of elements). Now we can use bash for loop to read or print values from
$files
:
# define it
files=("1_data" "2_analysis" "3_script")
# get length of $distro array
len=${#files[@]}
# Use bash for loop
for (( i=0; i<$len; i++ )); do echo "${files[$i]}" ; done
# Print from for-loop
1_data
2_analysis
3_script
This for-loop will the run through every element position in files
and
echo/print this for this position. If you also add ; echo $i;
you will
see that $i
is just a number and will start with i=0, then 1 and then
2 since we only have three elements.
Create a for-loop that makes changes to all files in a folder:
This for-loop will find all csv files, copy this csv file and change the
file extension to txt instead while keeping the original name of the
file. If you want to just replace the original file with the new
extension you can just replace cp
with mv
.
for i in *.csv; do
cp -- "$i" "${i%.csv}.txt";
done
-
Start with logging into NeLS website
-
Hold you mouse over your name in upper left corner
-
Click on Connection details
-
Click on Download key
- OBS: In the new version of NeLS you will find the key under
My Profile
- OBS: In the new version of NeLS you will find the key under
-
Place key in any folder on you computer and navigate (cd: change directory) to this location through your terminal.
-
Make sure you have correct access to the key by running the command below. Remember to adjust the code below to fit your key (add personal ID). These commands are also available on NeLS (where you donwloaded the key) and are adjusted to your specific key.
chmod 600 PERSONAL_ID@newstor.cbu.uib.no.key
- In the same directory as the key, log in to NeLS by using
similar code as below.
- NOTE: You might be asked if you are sure you want to continue
connecting. Type
yes
and you should automatically connect to NeLS
- NOTE: You might be asked if you are sure you want to continue
connecting. Type
ssh -i PERSONAL_ID@newstor.cbu.uib.no.key PERSONAL_ID@newstor.cbu.uib.no
-
Transferring files to personal space in NeLS:
You can transfer files to your personal space within NeLS, but it is not necessary to do this. You can transfer files directly from projects to a remote server, such as SAGA.
scp -i PERSONAL_ID@newstor.cbu.uib.no.key file-to-send-to-nels PERSONAL_ID@newstor.cbu.uib.no:Personal
9.1 Transferring files to another remote server example:
Below is an example on how to transfer a file from a project on NeLS, to
a remote server, such as SAGA:
scp -r * -i PERSONAL_ID@newstor.cbu.uib.no.key
PERSONAL_ID@newstor.cbu.uib.no:Projects/UiB_Sandaa_VirVar_Marine_phytoplankton_NGS_2020/210126_A00943.B.Project_Sandaa-RNA1-2020-12-04
USERNAME@saga.sigma2.no:/cluster/projects/nn9845k/FOLDERNAME
9.2 If you change directory (cd) to relevant folder in NeLS you do not need to specify the full file path, only the specific file(s) you want to transfer:
cd Projects/UiB_Sandaa_VirVar_Marine_phytoplankton_NGS_2020
scp -r * -i PERSONAL_ID@newstor.cbu.uib.no.key
210126_A00943.B.Project_Sandaa-RNA1-2020-12-04
USERNAME@saga.sigma2.no:/cluster/projects/nn9845k/FOLDERNAME
-
login: ssh USERNAME@saga.sigma2.no
-
Project folder: /cluster/projects/nn9845k
-
Go to home directory:
cd $HOME OR cd OR ~
-
See allocated storage:
dusage
-
See CPU hours used/remaining:
cost
-
Work space:
cd $USERWORK
Personal allocated temporary space, deleted after 42 days -
View computing nodes:
freepe
View computing nodes, see if there are free allocated spaces and view workload -
View current CPU hours:
qsumm
-
Watch the queue system:
squeue -u MyUsername
ORsqueue -j JobId
scancel JobId # Cancel job with id JobId (as returned from sbatch)
scancel --user=MyUsername # Cancel all your jobs
scancel --account=MyProject # Cancel all jobs in MyProject
scontrol requeue JobId # Requeue a running job. The job will be stopped, and its state changed to pending.
scontrol hold JobId # Hold a pending job. This prevents the queue system from starting the job. The job reason will be set to JobHeldUser.
scontrol release JobId # Release a held job. This allows the queue system to start the job.
sbatch --hold JobScript # Submit a job and put it on hold immediately
module avail
- to see available modules
module load
- load programs into memory
module purge
- unload all loaded programs
module list
- current loaded program
This helps avoiding to wait in queue and with this you can run as many script as you want within that time limit and cpu hours.
srun --account=nn9845k --mem-per-cpu=10G --time=10:00:00 --cpus-per-task=8 --pty bash -i
- In your home directory (where you start when you login to saga) type:
vim .bash_profile
- Under
User specific environment and startup programs
you can add as many aliases as you want. Below you will find a few examples:
# .bash_profile
# Get the aliases and functions
if [ -f ~/.bashrc ]; then
. ~/.bashrc
fi
# User specific environment and startup programs
alias virvar='cd /cluster/projects/nn9845k'
alias home='cd $home'
alias work='cd $USERWORK'
alias sq='squeue -u nis005'
PATH=$PATH:$HOME/.local/bin:$HOME/bin
export PATH
-
In the above example we can now use these aliases as commands. So when I type:
-
virvar
, go to project folder -
home
, go to home folder -
work
, go to personal working environment -
sq
, see jobs running
-
This will change the group affiliation on all the files in the project
fold. This is necessary to do if transferred files are still linked to
another project or person. You can see group affiliation by typing
ls -l
.
chgrp -R nn9845k /cluster/projects/nn9845k/
-
Anders Krabberød description: https://github.com/krabberod/VirVar_transcriptomes\
-
Visualize the sequence quality:
- Look at the quaility of the data by opening the FASTQ PDF file
- Sequence report: phred score table
- Sequencing forming a bridge that is read forward and reverse, but since it it deteriorates/get damaged quickly the reverse (R2) is always worse than forward strand (R1).
-
View the first line in the zip file:
zcat NAME.fastq.gz | head
- Here you can choose from using
Cutadapt
,Trimmomatic
andTrim_galore
. For our analysis below we usedTrim_galore
.Trim_galore
will remove adapters and primers (if present) and remove low quality sequences. (Trimmomatic can be run as a part of the Trinity!).
- Made a script for
Trim_galore
that we calledtrim_galore.slurm
and contained the following script:
#!/bin/bash
## Name the job, arbitrary
#SBATCH --job-name=trimgalore
## Project folder, every job must be accounted for
#SBATCH --account=nn9845k
## Text file of the commands that will be sent to the queue
#SBATCH --output=slurm-%j.base
## Every job requires a runtime limit
#SBATCH --time=10:00:00
## Number of cpus needed
#SBATCH --cpus-per-task=6
## Not needed unless more than one node is needed
##SBATCH --nodes=1
## Every job requires some specification of the memory (RAM) it needs
#SBATCH --mem-per-cpu=8G
## Not needed unless more than one node is used
##SBATCH --partition=normal
## If you ask for more than 180G mem per cpu you need to specify bigmen
##SBATCH --partition=bigmen
R1=$1 #Name for the first element
R2=$2 #Name for the second element
module purge
module load Trim_Galore/0.6.2-foss-2018b-Python-3.6.6
trim_galore $R1 $R2 --paired -j 6 #paired to R1 and R2, Tallet 6 må være det samme som cpus per task j = jobs/cpus.
-
OBS: New versions of Trim_Galore might be released and outdated on
SAGA. The newest version is today (17.10.23)
Trim_Galore/0.6.10-GCCcore-11.2.0
.
-
Place the script together with all the samples (in our example it will be in the folder
1_data
). This directory is located in:/cluster/projects/nn9845k/transcriptomics/1_PkVRF01_He028/1_data
-
While being
pwd
in this folder with theTrim_galore.slurm
script you can run the following command in the terminal:
for f in Sample_*; do
cp trim_galore.slurm $f;
cd $f;
sbatch trim_galore.slurm *R1_001.fastq* *R2_001.fastq*;
cd ..;
done
-
This code will do the following prior to each semicolon
;
- For every folder (f) that is called Sample_…
- Copy script called trim_galore.slurm to this current folder (f),
- Change directory to this folder (f),
- Run the script in this folder (f) using the two fastq files R1 and R2,
- Go back to the main folder
- done
-
Remember that any for-loops must start and end with
do
anddone
, respectively. Every command is seperated by;
or line-shift (above we did both). -
R1
ogR2
in the script refer to*R1_001.fastq*
and*R2_001.fastq*
-
More info about trim_galore: https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/
-
After loading
trim_galore
(module load) you can run:trim_galore -help
This might be helpful if you need to specify some parameters other than the standards given.
- Copy the trimmed sequences to a new folder and keep the folder
structure. This code will go into the folder directory
1_data
, go into all folders/directories that starts withSample_
and look for all files that containval
and copy the file and folder structure to new directory called2_trimmed_reads
.
rsync -avR data/Sample_*/*val* 2_trimmed_reads
- This loop creates new directories and softlinks (shortcuts) of the trimmed reads. You can use this instead of copying if you want to save space and avoid having to copy the same files multiple times.
- NB: This script generates a folder system for next step, which is
using
bowtie
for f in Sample_*; do
mkdir -p ./../3_bowtie_results/$f;
REF=$(readlink -f $f);
echo $REF;
BACK=$(pwd);
cd ./../3_bowtie_results/$f;
ln -s $REF/*val* .;
cd $BACK;
done
- This code will into every folder that starts with
Sample_
- It will then go one up the folder structure
..
and create a new directory3_bowtie_results
and within this directory it will create the folder structure ofSample_
. - It will then create a softlink of the files in
Sample_
and save it as an object calledREF
. - Next it will echo/print the object
REF
, which is every softlink it finds. - It will save pwd (present-working-directory) as the object
BACK
. - It will change directory to the newly created directory from step
2
. - In this directory it will make and save the links (
ln
) of the files called val in this current directory (.
). - Change directory back to the starting folder.
-
Several mapping programs can be used, but for now we will use
bowtie
to map to the genome andkallisto
(next section) to map to the orfs (predicted genes).Kallisto
allows quantification. -
Bowtie cannot be used to map to host genome because it is not a splice aware mapper, which is an issue when using eukaryote genomes as reference.
STAR
is a good splice aware alternative. -
Download the reference genome of PkV RF01 with annotation from:
wget https://raw.githubusercontent.com/RomainBlancMathieu/PkV-RF01/master/PkV-RF01_final.gff
wget https://raw.githubusercontent.com/RomainBlancMathieu/PkV-RF01/master/PkV-RF01_final.fnn
wget https://raw.githubusercontent.com/RomainBlancMathieu/PkV-RF01/master/PkV-RF01_final.fasta
wget https://raw.githubusercontent.com/RomainBlancMathieu/PkV-RF01/master/PkV-RF01_final.faa
-
After sequences have been trimmed by
trim_galore
we want to link/map the sequences to the virus genome by usingbowtie
. -
Mapping requires that the reference is indexed. This command is performed in the folder where the genome is located. A couple of new files will be made starting with
PkV-RF01_final_genome*
.
module purge
ml Bowtie2/2.4.4-GCC-10.3.0
bowtie2-build PkV-RF01_final.fasta PkV-RF01_final_genome
-
Next step is to place the bowtie.slurm script in the folder containing either a copy of trimmed galore sequences or softlinks to the trimmed galore sequences (see previous section of how we did this).
-
In the bowtie.slurm script you must set the location of the reference genome and the name you set in the bowtie-build command. Here we set the name as
PkV-RF01_final_genome
. The full location is/cluster/projects/nn9845k/transcriptomics/1_PkVRF01_He028/PkVRF01_annotation/Genome/PkV-RF01_final_genome
-
This is how your bowtie.slurm script should look like:
#!/bin/bash
#SBATCH --job-name=bowtie
#SBATCH --account=nn9845k
#SBATCH --output=slurm-%j.base
#SBATCH --time=1:00:00
#SBATCH --cpus-per-task=16
#SBATCH --mem-per-cpu=5G
#SBATCH --partition=normal
module purge
#module load Bowtie2/2.3.5.1-GCC-8.2.0-2.31.1
ml Bowtie2/2.4.4-GCC-10.3.0
ml SAMtools/1.12-GCC-10.3.0
REF=/cluster/projects/nn9845k/transcriptomics/1_PkVRF01_He028/PkVRF01_annotation/Genome/PkV-RF01_final_genome
bowtie2 -p 16 --no-unal -x $REF -1 *R1* -2 *R2* > mapped.sam
samtools sort -@ 16 mapped.sam > sorted.bam
samtools index sorted.bam
samtools view -h sorted.bam | cut -f 3,3 | grep 'PkV-RF01' | sort | uniq -c > counts.txt
- Now you must run this script for each sample with another for-loop
for f in Sample_*; do
cp bowtie.slurm $f;
cd $f;
sbatch bowtie.slurm *R1* *R2*;
cd ..;
done
-
For each sample the results will include:
- sorted.bam: file with reads mapped/aligned to the reference genome
- counts.txt: a simple summary of the number of reads mapping to the reference
-
Mapping the reads to the CDSs using Kallisto.
-
kallisto will map to your predicted genes and allows quantification. (kallisto also runs/uses bowtie for mapping)
-
Help section
-
If you only have the sequence position on genome, but not the sequences themselves, please see section below on
R
.
- First we must make the database required for Kallisto to run properly.
Use the CDS (Coding Domain Sequences) file of the genome (containing
predicted ORFs). This command needs to be run in the location where
you have your CDS-file: in this case
PkV-RF01_final.fnn
.
module purge
ml kallisto/0.46.1-foss-2020a
kallisto index PkV-RF01_final.fnn -i PkV-RF01_final_cds
- Next step is to place the kallisto.slurm script in the folder
containing either a copy or softlinks of trimmed galore sequences
(see section
4.1.4 Creating softlinks
). Example of softlink:
for f in Sample_*; do
mkdir -p ./../4_kallisto/$f;
REF=$(readlink -f $f);
echo $REF;
BACK=$(pwd);
cd ./../4_kallisto/$f;
ln -s $REF/*val* .;
cd $BACK;
done
-
In the kallisto.slurm script you must set the location of the reference genome and the name you set in the kallisto-build command. Here we set the name as
PkV-RF01_final_cds
. The full location is/cluster/projects/nn9845k/transcriptomics/1_PkVRF01_He028/PkVRF01_annotation/PkV-RF01_final_cds
-
This is how your kallisto.slurm script should look like:
#!/bin/bash
#SBATCH --job-name=kallisto
#SBATCH --account=nn9845k
#SBATCH --output=slurm-%j.base
#SBATCH --time=12:00:00
#SBATCH --cpus-per-task=16
#SBATCH --mem-per-cpu=5G
#SBATCH --partition=normal
module purge
ml kallisto/0.46.1-foss-2020a
REF=/cluster/projects/nn9845k/transcriptomics/1_PkVRF01_He028/PkVRF01_annotation/PkV-RF01_final_cds
OUT=kallisto_results
R1=*R1_001_val_1*
R2=*R2_001_val_2*
kallisto quant -i $REF -t "$SLURM_CPUS_PER_TASK" -o $OUT $R1 $R2
- Instead of specifying the number of cores in the kallisto script you
can write a shortcut/reference instead. This way you will only need to
change
cpus-per-task
without having to also change this number in the kallisto script. This is done by writing-t "$SLURM_CPUS_PER_TASK"
- Now you must run this script for each sample with another for-loop
for f in Sample_*; do
cp kallisto.slurm $f;
cd $f;
sbatch kallisto.slurm;
cd ..;
done
-
For each sample these output files are provided:
-
abundance.h5
: a binary file with abundance estimates (used in the R analysis) -
abundance.tsv
: a plaintext table of the abundance estimate for each gene -
run_info.json
: information about the run
-
-
The abundance file (TSV) will display the two columns
length
andeff_length
. Thelength
column displays the full length of a specific CDS, while theeff_length
displays the lenght that is unique/specific to that CDS. Many CDSs have several repetitive regions (eg. ATATATATATAT) which is not distinct/specific to that CDS. This is the reason why the mapped region (length
) is often shorter than gene itself (eff_length
). -
The abundance file (H5) is just a binary file that can only be read by computers. The reason for this file is that is much more efficient for a computer to read such a file compared to a normal text-file that we can read.
-
Note: It is also possible to set bootstrap values for kallisto
- Now you can use the
grep
command to fish out specific genes. e.g.:
grep -w "gene_195" Sample_*/kallisto_results/abundance.tsv
- To visualize the data from kallisto you will have to normalize the
data. This can be done ine
R
using a package called DESeq2 (See sections below with R).
- Kallisto will generate a new directory within each “Sample_*”
directory called “kallisto_results”
(e.g.
4_kallisto/Sample_01-c01va/kallisto_results
). To make it easier to import in R, it is nice to move all the results files inside this folder up one level or directory path, so all results files for each sample are directly inside the sample folder. We can create another for-loop that will go into all “kallisto_results” directories, take all its content and move it one level up so they are no longer “in a folder within a folder”. Move to 4_kallisto directory and run the following script:
for f in Sample_*; do
BACK=$(pwd);
cd $f;
cd kallisto_results;
mv * ..;
echo "${PWD}";
cd $BACK;
done
-
Before you download the kallisto results to your computer, you want to remove any unnecessary files so it takes shorter time to download and takes up much less space on your computer. Before you do this I would recommend copying the kallisto results to your SAGA
work environment
, this way nothing is accidentally removed. -
One option is to copy the files using
rsync
, which can exclude certain files upon transfer. We don’t want to copy the trimmed data as these files are quite large, so we can exclude these file extensions. The files look something like this1-c01va_S4_L001_R1_001_val_1.fq.gz 1-c01va_S4_L001_R2_001_val_2.fq.gz
, then we can exclude all files (*) ending withfq.gz
:
rsync -av --exclude '*fq.gz' 4_kallisto /cluster/work/users/USERNAME
-
Now we can change directory (cd) to your work environment and remove files we don’t need.
-
We can remove; the script (kallisto.slurm), saga output statistics (slurm-4312668.base) and the empty folder (kallisto_results).
-
We use the
rm
command to remove files and add-r
when we need to remove folders/directories. -
Since both the script and saga statistics contain the characters slurm, we can use this to remove both files by including
*slurm*
. The*
will overlook any characters following before or after this symbol, depending on if it is placed in the beginning or at the end, respectively. We will also use this symbol,*
, to go through all our sample folders. -Below we show how to remove the script and saga output (first command), and the folder (second commmand):
rm Sample_*/*slurm*
rm -r Sample_*/kallisto_results
## Optional command
## Run command if you did not exclude these files using rsync:
rm Sample_*/*fq.gz
- Once the files are cleaned up and of smaller size (see section above)
you can download (
scp
) all this data to your computer to runDESeq2
in R. - To do this you must go into the terminal and navigate on your own
computer (NOT ON SAGA) to the location you want to download all
these files. You can also specify the filepath where you want to
download, but here we just download in current or present
working directory (pwd) by using the period symbol
.
.
scp -r nis005@saga.sigma2.no:/cluster/work/users/nis005/4_kallisto .
- STAR is another mapper you can use as an alternative to kallisto.
- The reason you might want to choose STAR over kallisto is that STAR is a splice-aware software that you need in order to map reads to eukaryotic transcriptomes (due to introns and exons).
-
The first step of running star is to make a star-database of the genome. The script is slightly different depending on if the genome is annotated or not. Below we show an example script for genome with annotation file.
-
You can copy/paste the code directly in the terminal, without creating a bash file through
vim
, but you must be located in the directory with your genome files.
#!/bin/sh
#SBATCH --job-name=star
#SBATCH --account=nn9845k
#SBATCH --output=slurm-%j.base
##SBATCH --nodes=1
#SBATCH --cpus-per-task=16
#SBATCH --time=48:00:00
#SBATCH --mem-per-cpu=6G
##SBATCH --partition=bigmem
START=$(date +%s)
echo "###################################################"
echo "# Started: "
echo "# $(date)"
echo "###################################################"
echo ""
module purge
module load STAR/2.7.10b-GCC-11.3.0
STAR --runMode genomeGenerate --sjdbGTFfile PkV-RF01_final.gtf --genomeDir PkV-RF01_stardb --genomeFastaFiles PkV-RF01_final.fasta --genomeSAindexNbases 9 --runThreadN 16
# Computing runtime
secs=$(($(date +%s)-$START))
echo ""
echo "######################"
echo "Script finished: "
echo " $(date)"
echo "Running time:"
#echo "# $(($(date +%si)-$START)) seconds"
printf '%dd:%dh:%02dm:%02ds\n' $(($secs/86400)) $(($secs%86400/3600)) $(($secs%3600/60)) $(($secs%60))
echo "######################"
- If you only have a genome without annotation, you can use the same script as above, but change the STAR command line with:
STAR --runMode genomeGenerate --genomeDir --genomeFastaFiles Haptolina_ericina_var_UIO028.mainGenome.fasta --genomeSAindexNbases 9 --runThreadN 16
- Start by making softlinks of your trimmed transcriptomics data, in
the same manner as in section
4.1.4 Creating softlinks
. Here is an example:
for f in Sample_*; do
mkdir -p ./../5_star_results/$f;
REF=$(readlink -f $f);
echo $REF;
BACK=$(pwd);
cd ./../5_star_results/$f;
ln -s $REF/*val* .;
cd $BACK;
done
- Copy your STAR script to your
5_star_results
directory or generate a new withvim star.slurm
and copy the script below. Remember to change theGENOME=...
directory of the star database to match your folder location.
#!/bin/sh
#SBATCH --job-name=star
#SBATCH --account=nn9845k
#SBATCH --output=slurm-%j.base
##SBATCH --nodes=1
#SBATCH --cpus-per-task=16
#SBATCH --time=48:00:00
#SBATCH --mem-per-cpu=6G
##SBATCH --partition=bigmem
START=$(date +%s)
echo "###################################################"
echo "# Started: "
echo "# $(date)"
echo "###################################################"
echo ""
module purge
module load STAR/2.7.10a_alpha_220818-GCC-10.3.0
GENOME=/cluster/projects/nn9845k/transcriptomics/1_PkVRF01_He028/star_database/PkV-RF01_stardb
echo "GENOME is $GENOME"
R1=$1
R2=$2
echo "R1 and R2 are: " $R1 $R2
STAR --genomeDir $GENOME --readFilesIn $R1 $R2 --outSAMtype BAM Unsorted --runThreadN 16 --readFilesCommand zcat --outReadsUnmapped Fastx
# Computing runtime
secs=$(($(date +%s)-$START))
echo ""
echo "######################"
echo "Script finished: "
echo " $(date)"
echo "Running time:"
#echo "# $(($(date +%si)-$START)) seconds"
printf '%dd:%dh:%02dm:%02ds\n' $(($secs/86400)) $(($secs%86400/3600)) $(($secs%3600/60)) $(($secs%60))
echo "######################"
- Now you must run this script for each sample with another for-loop
for f in Sample_*; do
cp star.slurm $f;
cd $f;
sbatch star.slurm *R1* *R2*;
cd ..;
done
- To download the star result, we recommend first deleting unnecessary
files (as some can be quite large) before you download. See sections
4.3.7 Remove files (before download)
and4.3.8 Download results
. - It might also be necessary to run R scripts on SAGA, because of large
output files. Please follow the steps under section
5.1.2 STAR - Import dataset to R
.
-
Please click here for more information about RNA sequences.
-
Please download the R scripts here.
-
Then do the following:
- Extract zip file
-
Navigate to
R_data
folder -
Open
R_data.Rproj
-
Open
DESeq2_*.R
files in the lower right panel in RStudio
-
Note: In the following sections we will use the file
Metadata_He028_PkVRF01.xlsx
that contains the metadata of experiment 1 with He028 infected with PkV RF01. -
Start by loading all necessary packages.
library("tximport")
library("DESeq2")
library("tximportData")
library("tidyverse")
library("ggplot2")
library("gplots")
library("vsn")
library("IHW")
library("readxl")
library("rhdf5")
library("apeglm")
library("Rsubread")
library("ggeasy")
library("ggpubr")
library("Rmisc")
- If installation of packages is required, use the command below and enter the packages that are missing in your library.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("INSERT_PACKAGENAME_HERE")
- If you need to install all packages, use the command below:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("DESeq2","tximport","tximportData","tidyverse","ggplot2", "gplots","vsn","IHW","readxl","rhdf5","apeglm", "Rsubread", "ggeasy", "ggpubr", "Rmisc"))
## Bioconductor version 3.17 (BiocManager 1.30.22), R 4.3.1 (2023-06-16)
## Warning: package(s) not installed when version(s) same as or greater than current; use
## `force = TRUE` to re-install: 'DESeq2' 'tximport' 'tximportData' 'tidyverse'
## 'ggplot2' 'gplots' 'vsn' 'IHW' 'readxl' 'rhdf5' 'apeglm' 'Rsubread' 'ggeasy'
## 'ggpubr' 'Rmisc'
## Old packages: 'markdown'
-
In order to make all data comparable we must start by normalizing it. We will be using a package called
DESeq2
to do this. -
In the following two sections we have shown how to import dataset from both
Kallisto
andSTAR
outputs. -
The import steps are slightly different between
Kallisto
andSTAR
, but the rest of the analyses should be the same. -
First we must download the results from
Kallisto
and/orSTAR
from Saga and onto your own computer (if you haven’t already). Below is an example of downloading the results fromKallisto
, but remember to change to your own username and correct file path.
scp -r USERNAME@saga.sigma2.no:/cluster/projects/nn9845k/transcriptomics/1_PkVRF01_He028/4_kallisto_results .
-
The
-r
tell thescp
to copy all files and folder from the given directory to current directory by the period symbol.
. -
OBS: It might be necessary to remove certain files before download, as some files can be quite big. Please see sections
4.3.7 Remove files (before download)
and4.3.8 Download results
.
I. IMPORTING
-
All R-codes for importing data from kallisto is included in
DESeq2_kallisto.R
. -
We must start by reading all the abundance files (from Kallisto output) into R.
-
In order to do this we have to make a list containing all the sample names and the directories for each abundance file:
files <- list.dirs("1_PkVRF01_He028_kallisko") %>% .[-1] %>% file.path(.,"abundance.h5")
- The above command will make a list of the file paths for each
abundance file and should look something like shown below. You can
check the file by typing
head(files)
in the R Console
"1_PkVRF01_He028_kallisko/Sample_01-a01va/abundance.h5"
"1_PkVRF01_He028_kallisko/Sample_02-a01vb/abundance.h5"
"1_PkVRF01_He028_kallisko/Sample_03-a01vc/abundance.h5"
...
- We then make a separate list containing only the sample/folder name
for each abundance file by removing the
string
(or characters) before and after the/
symbol:
samples <- str_remove(files, "/abundance.h5") %>% str_remove("1_PkVRF01_He028_kallisko/")
- The
samples
list will look similar to the below example when typinghead(samples)
in the R Console:
Sample_01-a01va" "Sample_02-a01vb" "Sample_03-a01vc" ...
- Then we can link all these sample names (
samples
) to each file path (files
)
names(files) <- samples
- And
files
should look like this:
Sample_01-a01va
"1_PkVRF01_He028_kallisko/Sample_01-a01va/abundance.h5"
Sample_02-a01vb
"1_PkVRF01_He028_kallisko/Sample_02-a01vb/abundance.h5"
Sample_03-a01vc
"1_PkVRF01_He028_kallisko/Sample_03-a01vc/abundance.h5"
...
- With the file path information from
files
, we can import the dataset by using the packagetximport
.tximport
can also take other input data than Kallisto. Please see?tximport
for full documentation.
txi.kallisto <- tximport(files, type = "kallisto",
txOut = TRUE, ignoreAfterBar = T)
-
txi.kallisto
contains all sample names, genes and their abundances, and should look like this:
Sample_01-a01va Sample_02-a01vb Sample_03-a01vc ...
gene_1 849.0528 899.5036
gene_2 565.3992 611.5036
gene_3 181.4811 207.1772
gene_4 509.1173 554.5036
gene_5 167.4460 194.4977
gene_6 245.7182 279.0149
gene_7 500.1173 545.5036
gene_8 1263.0528 1313.5036
gene_9 319.0752 359.1965
gene_10 942.0528 992.5036
gene_11 423.2859 467.7722
gene_12 852.0528 902.5036
gene_13 780.0528 830.5036
...
II. VISUALIZE PSEUDOCOUNTS
-
You can visualize the pseudocounts before the data is normalized by using the base heatmap in R.
-
You can also look at clustering based on similarities of expression levels. This can be done by excluding
Colv
andRowv
. -
Additionally you can look at abundances instead of pseudocounts
heatmap(txi.kallisto$counts, Colv=NA, Rowv = NA) # 1
heatmap(txi.kallisto$counts) # 2
heatmap(txi.kallisto$abundance, Colv=NA, Rowv = NA) # 3
III. NORMALIZING DATA
- Before we normalize the data in
DESeq2
, we load in a metadataset that gives the experimental design:
metadata <- readxl::read_xlsx("Metadata_He028_PkVRF01.xlsx") # Read in metadata (new version)
-
Next step is to create the DESeq2 object.
-
We define the design of the experiment based on the metadataset, which in our case is:
- Treatment, Replicate and Timepoints
-
DESeq will automatically detect and treat replicates as replicates and will also make the data normalized per library.
dds <- DESeqDataSetFromTximport(txi.kallisto, #the imported abundance files
colData = metadata, #Read metadataset by above object
design=~Treatment+Replicates+Timepoints) #The experimental design
- The data can now be normalized by this function:
dds <- DESeq(dds)
- The function uses generalized linear model to do differential expression on the data.
- Note that this is the first step in normalization and it might not
handle/capture really low/high input values that well. For this you
would need a more complex model.
IV. TIPS
- To avoid having to run all the above commands every time you enter R, you can save the dataset instead. There are two methods that can be used.
saveRDS(dds, "dds.rds") # so you can load this datamatrix
#OR you can save Rdata (everything stored in the R-Environment):
save.image("all.Rdata") #save data-working environment
load("all.Rdata") #If you have cleared environment you can just load it here!
I. INTERACTIVE R ON SAGA
-
All R-codes for importing data from STAR bam files is included in
DESeq2_STAR.R
. -
Loading the data in STAR is slightly different compared to Kallisto, but the final step of normalization is the same.
-
If you are performing STAR normalization for the host genome, the files will probably be too big to run the analyses on your own computer. Fear not, we can start an interactive computing node on SAGA that will allow us to run R-script on saga:
srun --account=nn9845k --mem-per-cpu=10G --time=10:00:00 --cpus-per-task=8 --pty bash -i
- Once the job has gone through the queue you can load R on SAGA:
module load R/4.1.2-foss-2021b
II. IMPORTING
-
Whether you run the R-script on your computer or on SAGA, the following steps will be the same.
-
You start by loading
library("Rsubread")
-
If library is not installed, please run:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("Rsubread")
Next step is to make a list of the file directory of each “Aligned.out.bam” file from STAR:
files <- list.dirs("star_bam_results") %>% .[-1] %>% file.path(.,"Aligned.out.bam")
- We use featureCounts to read the in both the annotation file and all the “Aligned.out.bam” files:
fc <- featureCounts(files,
annot.ext = "star_bam_results/PkV-RF01_final.gtf",
isGTFAnnotationFile = TRUE,
isPairedEnd = TRUE)
III. RENAMING
-
To make the data easier to interpret we can rename the rownames and column names to more logical names.
-
Instead of every column names being the full file path, we can rename it to just the file name.
-
The command below will remove the folder name
star_bam_result
and the file nameAligned.out.bam
and contain only the unique file nameSample_XX...
. Then it will take this list and apply it to all column names offc$counts
.
samples <- str_remove(files, "/Aligned.out.bam") %>% str_remove("star_bam_results/")
colnames(fc$counts) <- samples
- We can also change row names to its specific gene number. We can do
this either:
- by using the rownames from the kallisto dataset
- or we can manually insert gene numbers as shown below.
rownames(fc$counts) <- rownames(dds) # 1
rownames(fc$counts) <- paste0("gene_",1:nrow(fc$counts)) # 2
IV. VISUALIZING
- The pseudocounts can be visualized, same as for kallisto.
heatmap(fc$counts, Colv=NA, Rowv = NA)
heatmap(fc$counts)
heatmap(fc$abundance, Colv=NA, Rowv = NA)
V. NORMALIZING
- The next step is to load the metadata:
metadata <- readxl::read_xlsx("Metadata_He028_PkVRF01.xlsx")
- Now you can create the DESeq object and normalize the data:
dds <- DESeqDataSetFromMatrix(fc$counts,
colData = metadata,
design=~Treatment+Replicates+Timepoints)
dds <- DESeq(dds)
- The following sections on visualizing and extracting the data should
now be the same for both STAR and kallisto results.
- All R-codes for extracting and visualizing data from either
kallisto
orSTAR
is included inDESeq2_visualize.R
. - The data is now normalized and we can have a closer look at specific
time points, p-values, genes etc.
- We can start by comparing different time points and we therefore import the comparisons based on the design from the metadata
resultsNames(dds)
- Now we pick two timepoints we want to compare based on the above list e.g. “Timepoints_13_vs_0”. This will generate a list with all genes and show whether or not these are differentially expressed based on the p-values
res <- results(dds, name = "Timepoints_13_vs_0")
res
- Next we can log form data to remove any extreme values (both high and low).
resLFC <- lfcShrink(dds, coef="Timepoints_13_vs_01", type="apeglm")
resLFC
- Then we can reorder these results to show the smallest p-values at top
and compare
resLFC
withres
. Note that these are the most differentially expressed genes between time point 0 and 13
resOrdered <- res[order(res$pvalue),]
resOrdered_2 <- resLFC[order(resLFC$pvalue),]
- We can also see how many adjusted p-values that were less than 0.01
sum(res$padj < 0.01, na.rm=TRUE)
sum(resLFC$padj < 0.01, na.rm=TRUE)
- Additionally it is possible to apply custom functions for performing
independent filtering and p-value adjustment using
results
. - Below we add
Independent Hypothesis Weighting
(ihw) as filtering:
resIHW <- results(dds, filterFun=ihw)
summary(resIHW)
sum(resIHW$padj < 0.05, na.rm=TRUE)
metadata(resIHW)$ihwResult
- The next step is to visualize the data and visually evaluate how the
normalized data looks.
-
We can use a simple helper function that makes a so-called “MA-plot”, i.e. a scatter plot of log2 fold changes (on the y-axis) versus the mean of normalized counts (on the x-axis).
-
Below we plot both the
res
data and the log formed dataresLFC
plotMA(res, ylim=c(-2,2)) # Scewed because only looking at viral data, should filter out control
plotMA(resLFC, ylim=c(-2,2))
- If the data looks skewed it is likely because you only look at part of
the data. E.g. if you look at the viral transcriptome it will also
include the control data, which should not contain much viral signals.
Filtering out your controls should help make the plot appear evenly
spread.
- Next we can look at specific genes and how the expression changes over time or compare treatments or replicates.
- We start by plotting the gene that changes the most over time (lowest p-value).
plotCounts(dds, gene=which.min(res$padj), intgroup="Timepoints")
- We can also input a specific gene number to see how the normalized expressed gene changes over time. Below we looked at genes 892, 940 and 637.
plotCounts(dds, 892, intgroup="Timepoints")
plotCounts(dds, 940, intgroup="Timepoints")
plotCounts(dds, 637, intgroup="Timepoints")
- To compare treatments of a specific gene you just change “Timepoints” with “Treatment”. Below compare control with infected culture using gene 637.
plotCounts(dds, 637, intgroup="Treatment")
- You can also include multiple parameters such as both
Treatment
andReplicates
plotCounts(dds, 637, intgroup=c("Replicates", "Treatment"))
- To make your plot look better visually you can use a package called
ggplot2
. - We start by saving the previous plot to an object and add this to
ggplot
. Below we show example of the most differentially expressed gene and a specific gene (gene_637).
p1 <- plotCounts(dds, gene=which.min(res$padj), intgroup="Timepoints",
returnData=TRUE)
p2 <- plotCounts(dds, 637, intgroup=c("Timepoints","Treatment","Replicates"),
returnData=TRUE)
ggplot(p1, aes(x=Timepoints, y=count)) + #swap p1 with p2 to see other plot
geom_point(position=position_jitter(w=0.1,h=0.0)) +
scale_y_log10(breaks=c(25,100,400))
- We can build further to these plots by adding colour to treatment.
- We need to load two packages
ggeasy
andggpubr
.
library("ggeasy")
library("ggpubr")
ggplot(p1 , aes(x=Timepoints, y=count, color = factor(Treatment))) + # Can change Treatment to Replicate
geom_point(position=position_jitter(w=0.1,h=0.0)) +
easy_add_legend_title("Treatment") +
ggtitle(paste("Gene xxx"))
### Can also add logarithmic scale if necessary
# +
#scale_y_log10(breaks=c(25,100,400))
- You can use a for-loop to display multiple plots. Below we exemplify this by displaying a matrix of 9 plots side by side of 9 different genes.
- First we must make a list of the genes of interest. We show example of using the top 9 most differentially expressed genes and a selection of genes.
- We use the ordered genes (
resOrdered
) and extract all the gene numbers in chronological order from most to least differentially expressed.
ordered_genes <- rownames(resOrdered) %>%
str_split_fixed("_",2) %>% .[,2] %>% as.numeric()
genes_of_interest <- ordered_genes[1:9] # Change to e.g. 10:19 to get the next 9 genes in the ordered list
- To select specific genes you simply make a data frame with these genes. Here we picked the four genes 32, 556, 25 and 50
genes_of_interest_2 <- c(32,556,25,50)
- Now we can create a list of our plots, and we do this by first creating an empty list followed by the for-loop that generates the desired plots and add them chronological to this list.
list_of_plots <-list()
for(i in 1:length(genes_of_interest))
{
list_of_plots[[i]] <- plotCounts(dds, genes_of_interest[i],
intgroup=c("Timepoints","Treatment","Replicates"),
returnData=TRUE) %>%
ggplot(aes(x=Timepoints, y=count, color = factor(Treatment))) +
geom_point(position=position_jitter(w=0.1,h=0.0)) +
easy_add_legend_title("Treatment") +
ggtitle(paste("Gene", genes_of_interest[i]))
}
### This line will arrange the plot 3 x 3
### Change the ncol and nrow if you have more/less plots
ggarrange(plotlist=list_of_plots, ncol = 3, nrow = 3)
- You can save this plot by using ggsave
ggsave("top9genes.pdf", width = 15, height = 15)
- We can calculate the mean of the replicates and display standard error
as error bars in the plot. For this we need to load
Rmisc
. - We use a function called summarySE on our data and define column name of the variable (gene count) and which columns that can be grouped (Time and Treatment).
library(Rmisc)
p1.1 <- p1 %>% summarySE(measurevar="count", groupvars=c("Timepoints","Treatment"))
- Now we can plot the data with SE and we only need to add one function
to the plot called
geom_errorbar()
.
ggplot(p1.1 , aes(x=Timepoints, y=count, color = factor(Treatment))) + # Can change Treatment to Replicate
geom_point() +
geom_errorbar(aes(ymin=count-se, ymax=count+se, color = Treatment), width=1) +
easy_add_legend_title("Treatment") +
ggtitle(paste("Gene xxx")) +
theme_classic() #Optional to make the plot look nicer
- We can SE to the matrix of plots using the for-loop as previously shown
list_of_plots <-list()
for(i in 1:length(genes_of_interest))
{
list_of_plots[[i]] <- plotCounts(dds, genes_of_interest[i],
intgroup=c("Timepoints","Treatment","Replicates"),
returnData=TRUE) %>% summarySE(measurevar="count", groupvars=c("Timepoints","Treatment")) %>%
ggplot(aes(x=Timepoints, y=count, color = factor(Treatment))) +
geom_point() +
geom_errorbar(aes(ymin=count-se, ymax=count+se, color = Treatment), width=.5) +
easy_add_legend_title("Treatment") +
ggtitle(paste("Gene", genes_of_interest[i])) +
theme_classic()
}
ggarrange(plotlist=list_of_plots, ncol = 3, nrow = 3)
- It is possible to display the log2 fold change in gene expression against its adjusted p-value. Note that the y-axis (-log10(padj)) is inverse, meaning the higher the value, the more differentially expressed the gene is.
- The plot should display a nice correlation between log2 fold change and adjusted p-value.
df <- resOrdered %>% as.data.frame()
ggplot(df, (aes(x = log2FoldChange, y = -log10(padj)))) +
geom_point(alpha=0.4, size=0.5) +
geom_vline(xintercept=c(-1,1),lty=4,col="black",lwd=0.2) +
geom_hline(yintercept = 1.301,lty=4,col="black",lwd=0.2) +
labs(x="log2(fold change)",
y="-log10 (adj.p-value)",
title="Volcanoplot") +
theme_bw()+
theme(plot.title = element_text(hjust = 0.5),
legend.position="right",
legend.title = element_blank())
- Reading the data you can include an additional argumen that specifies what comparison to extract from the object to build a results table.
colData(dds)
resMFType <- results(dds, contrast=c("Timepoints","02","ar"))
- You can evaluate the data by plotting the row standard deviations versus the row means.
- You have to start by transforming the DESeq2 matrix to a format that is possible to plot.
ntd <- normTransform(dds) #Transforming data
meanSdPlot(assay(ntd)) #plotting sd against means
- We can use
pheatmap
to visualize gene expression, where the function will cluster and aggregate rows. - The function also allows to aggregate the rows using k-means clustering. Click here for more information about k-mean clustering
- This is advisable if number of rows is so big that R cannot handle their hierarchical clustering anymore, roughly more than 1000.
- We start by visualizing the top 20 most expressed genes by
- Extracting count data
- Order gene expression in decreasing order and picking top 20 most expressed genes.
- We can also use all genes, but cluster using k-means and selecting all genes
- and finally make a metadata overview for
pheatmap
select <- order(rowMeans(counts(dds,normalized=TRUE)), # 1
decreasing=TRUE)[1:20] # 2
select.all <- order(rowMeans(counts(dds,normalized=TRUE)), # 3
decreasing=TRUE)
df <- as.data.frame(colData(dds)[,c("Replicates","Timepoints")]) # 4
- Now we can display the heatmap and adjust the following:
- Reads in the transformed data from previous section
ntd
and extracts only the top 20 genes from the objectselect
. Here we can changeselect
withselect.all
, but the we also must add k-means (see point 5). We can also changeselect
withgene_of_interest
, which we created earlier. This will show the most differentially expressed genes instead of the most expressed genes. - Clusters the rows, which will group expressed genes and you can possibly include 100s of genes in this type of cluster.
- Displays/hides gene/cluster names, change to
FALSE
if you want to hide it - Clusters the columns, which will group the time points. If the time scale order is the same as before grouping, then you can have good confidence in your data ;)
- If you use object
select.all
there are too many genes for R to handle and we must cluster the genes usingkmeans_k
. The number can be set to the desired number of clusters. - Annotation names from the recently created metadata
df
- Reads in the transformed data from previous section
pheatmap(assay(ntd)[select,], # 1
cluster_rows=FALSE, # 2
show_rownames=TRUE, # 3
cluster_cols=FALSE, # 4
#kmeans_k = 20, # 5
annotation_col=df) # 6
- We can run a Principal Component Analysis (PCA) to visualize the in a
simpler format. You must define the dataset
ntd
, the intgroup that is the group and can beTimepoints
,Replicate
,Treatment
etc, and you must define ntop that is the number of top genes to use for principal components, selected by highest row variance. - You should hopefully see a clustering of early, mid and late expressed genes. If so you can make an elipse around the different time phases.
plotPCA(ntd, intgroup="Timepoints",ntop = 500)
plotPCA(ntd, intgroup="Replicate",ntop = 1100) #Replicate and all genes
- We can remove the control from our sample and subset only to the infected cultures. This is done by looking for “v” in our metadata:
dds_vir <- dds[, dds$Treatment %in% "v"]
- It is also possible to extract specific timepoints using the above command:
dds_01_and_14 <- dds[, dds$Timepoints %in% c("01", "14")]
- We can also exclude the resistant cultures
ar
from our dataset anddds_vir
will now only contain infected cultures in the active infection cycle:
`%notin%` <- Negate(`%in%`)
dds_vir<-dds_vir[, dds_vir$Timepoints %notin% "ar"]
- Now we can run all of the above analyses and plots, but without the data from controls and resistant cultures.
resultsNames(dds_vir)
res_vir <- results(dds_vir, name = "Timepoints_13_vs_01")
resLFC_vir <- lfcShrink(dds_vir, coef="Timepoints_13_vs_01", type="apeglm")
resLFC_vir
plotCounts(dds_vir, gene=which.min(res_vir$padj), intgroup="Timepoints")
d <- plotCounts(dds_vir, gene=which.min(res_vir$padj), intgroup="Timepoints",
returnData=TRUE)
ggplot(d, aes(x=Timepoints, y=count)) +
geom_point(position=position_jitter(w=0.1,h=0.0)) +
scale_y_log10(breaks=c(25,100,400))
ntd_vir <- normTransform(dds_vir)
meanSdPlot(assay(ntd_vir))
pheatmap(assay(ntd_vir)[select,],
cluster_rows=FALSE,
show_rownames=TRUE,
cluster_cols=FALSE,
annotation_col=df)
plotPCA(ntd_vir, intgroup="Timepoints",ntop = 1100)
-
Introduction to stringr: https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html
-
If you only have the sequence position of the CDS regions, but not the sequences themselves you can extract them yourself.
-
First we must load tidyverse
library(tidyverse)
-
Then you must load the full sequence genome and the data sheet where you have your CDS sequence position numbers. You might have an excel sheet containing a column with gene number, sequence position, pos/neg strand etc like this:
gene_1|GeneMark.hmm|629_aa|-|2|1888
. Here you will find the sequence position starts at 2 and ends at 1888. You need to save this excel sheet as a csv file in order for R to read it properly. Below I have shown this together with selecting only the column that contains the neccessary info about the CSD regions.
annotation.df <- read.csv("PkVRF02.csv", sep=";") %>% select(CDS_regions)
genome_seq.df <- read_lines("PkV-RF02_v2.fasta")
- Next step is to isolate the numbers inside each row containing the CDS
regions. Below we use the function
separate
to do this, and since we do not specify the separator symbol it will separate on any common separator type (- / | . , _ etc). If we take the first row as an example, it will splitgene_1|GeneMark.hmm|629_aa|-|2|1888
into eight new columns “Gene”,“Nr”, “Gmark”, “Hmm”, “aa”, “pos_neg”, “from”, “to”. Then weselect
only thefrom
andto
column that contains the position data.
- Gene -
gene
- Nr -
1
- Gmark -
GeneMark
- Hmm -
hmm
- aa -
aa
- pos_neg -
-
- from -
2
- to -
1888
position.df <- annotation.df %>%
separate(col=CDS_regions, into=c( "Gene","Nr", "Gmark", "Hmm", "aa", "pos_neg", "from", "to")) %>%
select(c("from", "to"))
- Now we have isolated the CDS regions into one table and we can look
into the genome sequence. To start with you should check that he whole
genome sequence is listed as one continuous line or string. Often the
sequence will listed over multiple lines with a
line shift
and then you will need to remove thisline shift
in order to easily access the position on the whole sequence. We can remove all line shifts by using a for-loop where we will go into every line and save/concatenate it as a continuous string/line.
#nchar()
#str_length()
#str_sub(x, from, to)
genome_clean = "" #1
for (i in 2:(length(genome_seq.df)+1) ) { #2
#if (i < 10) { #3
temp <- paste(genome_clean, genome_seq.df[i], sep = "") #4
genome_clean <- temp #5
#} #end of if
}
- Empty string, so we can use it to add strings to it
- for every nr “i” from 2 (excludes heading) to total number of lines in genome_seq.df + 1 (to include last line)
- you can test with only the first 10 lines by using “if”
- concatenate string in position “i” without separation to the object genome_clean
- save this concatenated string to genome_clean so it can be used in the next loop in step 4.
- Now we have a continuous genome sequence and the table with CDS regions/positions. In this step we will make an fnn file that contains the name of the the sequence followed by the nucleotide sequence itself.
PkV_cds = "" #1
for (p in 1:(nrow(position.df) )) { #2
temp.2 <- paste(PkV_cds, ">", annotation.df[p,1], " >PkV-RF02", "\n", sep = "") #3
temp.3 <- str_sub(genome_clean, position.df[p,1], position.df[p,2]) #4
PkV_cds <- paste(temp.2, temp.3, "\n", "\n", sep = "") #5
}
cat(PkV_cds) #6
#View file with line shift cat()
write(PkV_cds, file = "PkV-RF02.fnn") #7
- Empty string
- for every number “p” from 1 to number of rows in position.df, do the following
- Creates heading that starts with “>”, add gene nr, add ” >PkV-RF02 and end with line shift
- Adds sequence “from” column “1” “to” column “2” for every row “p”
- Concatenate step 3 and 4 into object PkV_cds (concatenate heading and sequence with line shift)
- View output
- Write to file
- Go to github and find the repository you want to dowload/clone. Click
on the green
Code
button and copy HHTPS link and paste it as below:
git clone https://github.com/krabberod/VirVar_transcriptomes.git
Start by making the wikipage by creating a new one in the repository on github.com.
In the terminal type the following to clone wiki locally in pwd:
git clone https://github.com/YOUR_USERNAME/YOUR_REPOSITORY.wiki.git
Make the necessary modifications you need and to commit and push changes do the following:
cd /path/to/local/cloned/wiki
git add .
git commit -m "new files"
git push
-
Clone git repository in the Github app.
-
In the Terminal window move or
cd
to git folder. e.g. Documents/Github/Transcriptomics. -
Remove the two last commits. Change number if you want to remove more/less
git reset --soft HEAD~2
- Then force push this commit to the relevant branch (in our example its
the
main
branch)
git push origin +main --force
- OR check out https://sethrobertson.github.io/GitFixUm/fixup.html
-
Removing an entire commit
I call this operation “cherry-pit” since it is the inverse of a “cherry-pick”. You must first identify the SHA of the commit you wish to remove. You can do this using gitk –date-order or using git log –graph –decorate –oneline You are looking for the 40 character SHA-1 hash ID (or the 7 character abbreviation). Yes, if you know the “^” or “~” shortcuts you may use those.
git rebase -p --onto SHA^ SHA
-
Obviously replace “SHA” with the reference you want to get rid of. The “^” in that command is literal.
-
However, please be warned. If some of the commits between SHA and the tip of your branch are merge commits, it is possible that git rebase -p will be unable to properly recreate them. Please inspect the resulting merge topology gitk –date-order HEAD ORIG_HEAD and contents to ensure that git did want you wanted. If it did not, there is not really any automated recourse. You can reset back to the commit before the SHA you want to get rid of, and then cherry-pick the normal commits and manually re-merge the “bad” merges. Or you can just suffer with the inappropriate topology (perhaps creating fake merges git merge –ours otherbranch so that subsequent development work on those branches will be properly merged in with the correct merge-base).
-
Final step prior to analysis in R is to unzip the kallisto files
-
DESeq do normalization of samples
-
Substract the virus mapping reads from the complete reads to get all host reads
See differences of host reads between infected vs control -
samtools go into manual, to change mapping cutoff, do not need to rerun analysis
-
sam file textfile (larger file)
-
bam file compressed samfile, only readable by machine
-
RDA to explain the variation over time, or night vs day etc.
-
Combat-seq R package to normalize data when for instance you have data that are not from the same experiment or the data between replicates is “bulky”. For instance we should use this package if we want to compare data from two different infection experiment. https://github.com/zhangyuqing/ComBat-seq
-
WGCNA: weighted gene co-expression network analyses. Can be used to correlate genes.
-
Google WGCNA with DeSeq2, to use the normalized data from Deseq2.
-
WGCNA can also normalize data, but should be avoided as it is better to use the already normalized data from DESeq2
-
Possibly generate a table that can be used in network analysis in
cytoscape
-
Look into
exportNetworkToCytoscape
in WGCNA