diff --git a/2024_edition/01_sherlockConnection.md b/2024_edition/01_sherlockConnection.md new file mode 100644 index 0000000..891b72e --- /dev/null +++ b/2024_edition/01_sherlockConnection.md @@ -0,0 +1,154 @@ +# Connection to Sherlock and familiarization with the linux shell + +We will do all of our bioinformatics analysis on the Stanford High Performance Computer Cluster Sherlock (www.sherlock.stanford.edu). In this part, we will illustrate the basic workflow for connecting to the cluster and getting everything ready for a computational project. The steps are: +- ssh into the cluster +- set up our working directory +- create a persistent session on the cluster to come back to +- request computational resources +- load necessary packgages + + +At the end of this notebook you will know how to : +- navigate the linux filesystem with `ls`, `cd`, `mkdir`, `pwd` +- set up a nicely organized folder for the project +- use a Gnu Screen session +- use SLURM to access to computational resources + + +## Connection to Sherlock +In your browser go to , then click on `>Shell` +Alternatively, you can ssh into sherlock using a terminal app on your computer. On a Mac, you can use the native terminal app Term. Open up Term (Command-space term), then +``` +ssh @login.sherlock.stanford.edu +``` + + +## Setting up our workspace for the project. +Go to the group minicourse directory +``` +cd $GROUP_SCRATCH/biochem_minicourse_2024 +``` +Note that `$GROUP_SCRATCH` is a bash variable, which contains the path to a default "scratch" folder for our lab. You can see the content of this variable with +``` +echo $GROUP_SCRATCH +``` + +Make a new folder for yourself, and move to that folder. For example for team Straight +``` +mkdir +cd +``` + +This folder is currently empty. You can list the content of the current folder with +``` +ls +``` + +You can always see your current location in the filesystem with +``` +pwd +``` + +Now let's create a few subfolders to organize our work. We want our project directory (the team directory in this case) to look like this +```text + +├── data +│   └── samples +| └── external +└──tmp +``` + +- The `tmp` folder is a scratch folder which we will use to put temporary files (some programs create temporary files and require a dedicated folder for these files). +- The `data` folder will contain analysis for specific datasets, arranged into subfolders. + +Make these directories with +```bash +mkdir data + +mkdir data/samples data/external + +mkdir tmp +``` +Verify the tree structure with +```bash +tree . +``` + +## Running persistent / recoverable sessions with Gnu Screen + +What happens if I am in the middle of some task on Sherlock and I lose internet connection or close my computer? To avoid having to go back to square one, we need to set up a persistent bash session. The standard way to do this is using a window manager such as GNU Screen. + +```bash +screen -S genomics +``` + +This creates a new session called genomics. Let's demonstrate what it does by putting a mark in our terminal, leaving and then coming back + +```bash +# leave a mark +echo "I am here" +# keep track of the node number in the prompt (for example sh01-ln03) +#the screen session will only be accessible from this login node + +# close your terminal, then relogin into sherlock +ssh sunetid@login.sherlock.stanford.edu + +# if you're not assigned the same login node as before, connect to the original one. If it's the same skip this step +ssh sh01-ln03 + +# get back to you persistent session +screen -x +``` + +## Getting computational resources on Sherlock + +When we ssh into sherlock, we automatically obtain a shell on a "login node". So far, we've ran all of our commands on a login node. You can see that from the look of the prompt `[username@sh01-ln03 login ~]`. Login nodes are not meant to do any heavy computations. You can think of them as entry points into Sherlock, and places to do very simple operations, like parsing/creating directories, viewing files, etc... + +For anything more computationally heavy than that, you need to request dedicated resources (in fact, if you run a command on a login node that takes too long to complete or requests too much memory, it will automatically get aborted). The way to ask for resources on Sherlock is through the resource manager and job scheduler Slurm (https://slurm.schedmd.com/documentation.html). There are several ways to ask for resources with Slurm depending on whether you want to run an interactive job (where you keep entering commands), or want to submit a job that will run without your input when resources become available. This is explained in "Running jobs on Sherlock" . + +For this bootcamp, we are going to request 2 cpus each for 3h. +`salloc -p astraigh --time=03:00:00 --cpus-per-task=2` + +You should quickly get a prompt that looks like that `salloc: Pending job allocation ` +Your command line will then read: [username@sh02-09n13 /scratch/groups/astraigh/biochem_minicourse_2022/]$ +This shell is running on a dedicated computational node (here sh02-09n13). Within this shell, you'll have access to 2 CPUS and 32 GB of RAM. + +This computational node is part of the partition `astraigh` (specificed by the `-p astraigh` flag) which is reserved for our lab. There are other partitions you can use, refer to the Sherlock documentation. + +## Loading packages for our subsequent analysis + +We preinstalled some bioinformatics tools we're going to use during this bootcamp inside a conda environment. Conda is a unix package manager that helps maintain your software packages in order, and an environment is an isolated set of software packages that are required for a task or computational pipeline. + +First add the conda path to your bash profile with these two commands: + +``` +echo -e "PATH=/home/groups/astraigh/miniconda3_py38_4.9.2/bin:$PATH" >> ~/.bashrc +export PATH=/home/groups/astraigh/miniconda3_py38_4.9.2/bin:$PATH +``` + +Load this environment with +``` +conda activate minicourse +#or you can try: +source activate minicourse +``` +(alternatively conda activate /home/groups/astraigh/miniconda3_py38_4.9.2/envs/minicourse) + +## Managing your SLURM jobs + +You can see the jobs currently running on the cluster with +``` +squeue -u $USER +``` + +You can cancel a running job with +``` +scancel +``` +where the JOBID is obtained from `squeue` + +## Additional resources + +- Running jobs on Sherlock : https://www.sherlock.stanford.edu/docs/user-guide/running-jobs/ +- advanced connection options : https://www.sherlock.stanford.edu/docs/advanced-topics/connection/ +- GNU screen : https://www.howtoforge.com/linux_screen diff --git a/2024_edition/02_externalData_downloadAndBasicsQC.md b/2024_edition/02_externalData_downloadAndBasicsQC.md new file mode 100644 index 0000000..7aad034 --- /dev/null +++ b/2024_edition/02_externalData_downloadAndBasicsQC.md @@ -0,0 +1,145 @@ +# Intro to long read data + +## Goals +In this part, we're going to download some sequencing data from the Gene Expression Omnibus https://www.ncbi.nlm.nih.gov/geo/, which is a public genomics data repository . For the purpose of familiarizing ourselves with accessing, downloading, and analyzing data from other groups, we will use a dataset derived from mouse sperm cells. https://www.ncbi.nlm.nih.gov/sra?term=SRX9822381 + + + +Covered tools and concepts in this notebook: +- fastq files +- use basic commands to view files `cat`,`head` and `less` +- use `wc` to count lines and very simple `awk` scripts to perform line by line operations on a file +- use `sort` and `uniq` to create histograms +- chain linux commands with `|` and redirect output to a file with `>` +- run a command line program such as `fastqc` to process the data in some manner +- download data from Sherlock to your local machine with `rsync` + +## GEO data download +Data stored on GEO can be downloaded using the `fasterq-dump` tool. +We'll download the data within the subfolder `external` within the folder `/data` we created earlier. Always keep things organized to avoid headaches down the road! + +```bash +#recall your folder name & set an environment variable 'me' so you can refer to this folder later +export me="$GROUP_SCRATCH/biochem_minicourse_2024/" + +#move to the folder for this dataset +cd $me/data/external + +# download the data (note that we use the tmp folder created earlier for the temporary files generated by this program) +Note: the following command takes a bit of time, so we've already downloaded the file into our directory. Use the cp command to copy from our dir to yours (this is the first 100000 lines from the full fastq file). +fasterq-dump SRR13403380 --progress --threads 2 --temp $me/tmp --outdir $me/data/external + +cp $GROUP_SCRATCH/biochem_minicourse_2024/straightlab/data/external/SRR13403380_subset.fastq ./ +``` + +Let's do some quick inspection of the data we downloaded. +``` +ls -lh +``` +We can see the previous command just downloaded a file called 'SRR13403380.fastq' wich is 897Mb in size +We can also see the subset file we copied from the straightlab directory 'SRR13403380_subset.fastq' is 16Mb + +This is a fastq formatted file, which is the standard format for sequencing data. We can figure out how the file is organized by looking at the first 10 lines. + +``` +head SRR13403380_subset.fastq +``` + +This is of the form +```text +<@READID (first read) and some other read info> + +<+repeat of the READID line> + +<@READID (next read) and some other read info> +etc.. +``` +So each read takes up 4 lines, with the sequence for read 1 at line2, sequence for read2 at line 4+2=6, etc... + +## Basic qc +### The quick manual way (optional) +A lot of basic analysis of sequencing data can be done without relying on external software. It is extremely useful to know a few built-in linux commands that will come in handy in many contexts and especially for parsing raw sequencing data. 4 commands that are used often are `wc`, `sort`, `awk`, `uniq` (if you want to learn more commands you should also look at `cut`, `join` and `paste`). Let's demonstrate a real world application of these commands by doing the following qc analysis: + +- how many reads are present in our dataset +- what is the read length distribution + +We can see how many reads are in this file by counting the number of lines and dividing by 4 + +``` +wc -l SRR13403380_subset.fastq +# 100000 SRR13403380_subset.fastq +# 100000/4 = 250000 reads + +#we can do the same for the full file +wc -l /scratch/groups/astraigh/biochem_minicourse_2024/straightlab/data/external/SRR13403380.fastq +``` + +What is the length of each read? We need to count the number of characters for each sequence line. First let's extract the sequences from the file. `awk` is a great unix program to manipulate text files line by line. Here we just filter lines with a number modulo 4 = 2 (line numbers divided by 4 that give a remainder of 2). +Let's extract the first 3 sequences (3 multiplied by 4 lines = `head -n12`). The pipe operator `|` allows us to chain commands. The parenthesis in `awk` serves as an if statement. `$0` is a variable containing the whole line. + +``` +head -n12 SRR13403380_subset.fastq | awk '(NR%4==2){print $0}' +``` + +Now to get the length of these first three sequences, we just print the lengh of the line instead of the line itself + +``` +head -n12 SRR13403380_subset.fastq | awk '(NR%4==2){print length($0)}' +``` + +Finally, we want to build a histogram of how many times each read length is represented. For that, we need two useful commands: sort and uniq. Sort will just sort the lines, and uniq (which requires sorted input), will count (with flag `-c`) how many times each unique line occurs. We're ready to do that for the whole file rather than the first three sequences so let's replace `head` with `cat`, which just reads through the whole file + +``` +cat SRR13403380_subset.fastq | awk '(NR%4==2){print length($0)}' | sort | uniq -c +``` + +We can redirect the output to a file with the `>` operator. Let's call this file `readlength.hist.txt` . One last command we may want to add to the chain of commands is another `sort` command so that the histogram is sorted in such a way that the mode of the distribution comes first (the length that shows up the most often). This is achieved with + +```bash +#create histogram and save it into a file +cat SRR13403380_subset.fastq | awk '(NR%4==2){print length($0)}' | sort | uniq -c | sort -k1,1nr > readlength.hist.txt +``` + +We can look at this file in a scrollable way with `less`. +``` +less readlength.hist.txt +``` + +### QC with a dedicated tool fastqc and downloading data +The `fastqc` tool (preinstalled on the lab partition) can be used to get some other qc metrics, in particular about sequencing quality and overrepresentated sequences. + +``` +fastqc SRR13403380_subset.fastq -t 2 +``` + +This produced an html file, which we need to download to our computer to look at. File transfer operations to and from Sherlock are best done using `rsync` + +Open a new terminal tab in your computer, and then run +``` +# replace with your folder name, and replace ~/Downloads if necessary +rsync -ah --progress @dtn.sherlock.stanford.edu:/scratch/groups/astraigh/biochem_minicourse_2024//data/external/SRR13403380_subset_fastqc.html ~/Downloads + +``` +The command has the form `rsync