Skip to content

Commit

Permalink
Created 2024 directory and updated files
Browse files Browse the repository at this point in the history
  • Loading branch information
Kelsey Fryer authored and Kelsey Fryer committed Sep 6, 2024
1 parent 3d578ea commit 886ab82
Show file tree
Hide file tree
Showing 6 changed files with 537 additions and 0 deletions.
154 changes: 154 additions & 0 deletions 2024_edition/01_sherlockConnection.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
# Connection to Sherlock and familiarization with the linux shell

We will do all of our bioinformatics analysis on the Stanford High Performance Computer Cluster Sherlock (www.sherlock.stanford.edu). In this part, we will illustrate the basic workflow for connecting to the cluster and getting everything ready for a computational project. The steps are:
- ssh into the cluster
- set up our working directory
- create a persistent session on the cluster to come back to
- request computational resources
- load necessary packgages


At the end of this notebook you will know how to :
- navigate the linux filesystem with `ls`, `cd`, `mkdir`, `pwd`
- set up a nicely organized folder for the project
- use a Gnu Screen session
- use SLURM to access to computational resources


## Connection to Sherlock
In your browser go to <login.sherlock.stanford.edu> , then click on `>Shell`
Alternatively, you can ssh into sherlock using a terminal app on your computer. On a Mac, you can use the native terminal app Term. Open up Term (Command-space term), then
```
ssh <username>@login.sherlock.stanford.edu
```


## Setting up our workspace for the project.
Go to the group minicourse directory
```
cd $GROUP_SCRATCH/biochem_minicourse_2024
```
Note that `$GROUP_SCRATCH` is a bash variable, which contains the path to a default "scratch" folder for our lab. You can see the content of this variable with
```
echo $GROUP_SCRATCH
```

Make a new folder for yourself, and move to that folder. For example for team Straight
```
mkdir <your_dir_name>
cd <your_dir_name>
```

This folder is currently empty. You can list the content of the current folder with
```
ls
```

You can always see your current location in the filesystem with
```
pwd
```

Now let's create a few subfolders to organize our work. We want our project directory (the team directory in this case) to look like this
```text
<your_dir_name>
├── data
│   └── samples
| └── external
└──tmp
```

- The `tmp` folder is a scratch folder which we will use to put temporary files (some programs create temporary files and require a dedicated folder for these files).
- The `data` folder will contain analysis for specific datasets, arranged into subfolders.

Make these directories with
```bash
mkdir data

mkdir data/samples data/external

mkdir tmp
```
Verify the tree structure with
```bash
tree .
```

## Running persistent / recoverable sessions with Gnu Screen

What happens if I am in the middle of some task on Sherlock and I lose internet connection or close my computer? To avoid having to go back to square one, we need to set up a persistent bash session. The standard way to do this is using a window manager such as GNU Screen.

```bash
screen -S genomics
```

This creates a new session called genomics. Let's demonstrate what it does by putting a mark in our terminal, leaving and then coming back

```bash
# leave a mark
echo "I am here"
# keep track of the node number in the prompt (for example sh01-ln03)
#the screen session will only be accessible from this login node

# close your terminal, then relogin into sherlock
ssh sunetid@login.sherlock.stanford.edu

# if you're not assigned the same login node as before, connect to the original one. If it's the same skip this step
ssh sh01-ln03

# get back to you persistent session
screen -x
```

## Getting computational resources on Sherlock

When we ssh into sherlock, we automatically obtain a shell on a "login node". So far, we've ran all of our commands on a login node. You can see that from the look of the prompt `[username@sh01-ln03 login ~]`. Login nodes are not meant to do any heavy computations. You can think of them as entry points into Sherlock, and places to do very simple operations, like parsing/creating directories, viewing files, etc...

For anything more computationally heavy than that, you need to request dedicated resources (in fact, if you run a command on a login node that takes too long to complete or requests too much memory, it will automatically get aborted). The way to ask for resources on Sherlock is through the resource manager and job scheduler Slurm (https://slurm.schedmd.com/documentation.html). There are several ways to ask for resources with Slurm depending on whether you want to run an interactive job (where you keep entering commands), or want to submit a job that will run without your input when resources become available. This is explained in "Running jobs on Sherlock" <https://www.sherlock.stanford.edu/docs/user-guide/running-jobs/>.

For this bootcamp, we are going to request 2 cpus each for 3h.
`salloc -p astraigh --time=03:00:00 --cpus-per-task=2`

You should quickly get a prompt that looks like that `salloc: Pending job allocation <ID>`
Your command line will then read: [username@sh02-09n13 /scratch/groups/astraigh/biochem_minicourse_2022/<your_dir_name>]$
This shell is running on a dedicated computational node (here sh02-09n13). Within this shell, you'll have access to 2 CPUS and 32 GB of RAM.

This computational node is part of the partition `astraigh` (specificed by the `-p astraigh` flag) which is reserved for our lab. There are other partitions you can use, refer to the Sherlock documentation.

## Loading packages for our subsequent analysis

We preinstalled some bioinformatics tools we're going to use during this bootcamp inside a conda environment. Conda is a unix package manager that helps maintain your software packages in order, and an environment is an isolated set of software packages that are required for a task or computational pipeline.

First add the conda path to your bash profile with these two commands:

```
echo -e "PATH=/home/groups/astraigh/miniconda3_py38_4.9.2/bin:$PATH" >> ~/.bashrc
export PATH=/home/groups/astraigh/miniconda3_py38_4.9.2/bin:$PATH
```

Load this environment with
```
conda activate minicourse
#or you can try:
source activate minicourse
```
(alternatively conda activate /home/groups/astraigh/miniconda3_py38_4.9.2/envs/minicourse)

## Managing your SLURM jobs

You can see the jobs currently running on the cluster with
```
squeue -u $USER
```

You can cancel a running job with
```
scancel <JOBID>
```
where the JOBID is obtained from `squeue`

## Additional resources

- Running jobs on Sherlock : https://www.sherlock.stanford.edu/docs/user-guide/running-jobs/
- advanced connection options : https://www.sherlock.stanford.edu/docs/advanced-topics/connection/
- GNU screen : https://www.howtoforge.com/linux_screen
145 changes: 145 additions & 0 deletions 2024_edition/02_externalData_downloadAndBasicsQC.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
# Intro to long read data

## Goals
In this part, we're going to download some sequencing data from the Gene Expression Omnibus https://www.ncbi.nlm.nih.gov/geo/, which is a public genomics data repository . For the purpose of familiarizing ourselves with accessing, downloading, and analyzing data from other groups, we will use a dataset derived from mouse sperm cells. https://www.ncbi.nlm.nih.gov/sra?term=SRX9822381



Covered tools and concepts in this notebook:
- fastq files
- use basic commands to view files `cat`,`head` and `less`
- use `wc` to count lines and very simple `awk` scripts to perform line by line operations on a file
- use `sort` and `uniq` to create histograms
- chain linux commands with `|` and redirect output to a file with `>`
- run a command line program such as `fastqc` to process the data in some manner
- download data from Sherlock to your local machine with `rsync`

## GEO data download
Data stored on GEO can be downloaded using the `fasterq-dump` tool.
We'll download the data within the subfolder `external` within the folder `<your_dir>/data` we created earlier. Always keep things organized to avoid headaches down the road!

```bash
#recall your folder name & set an environment variable 'me' so you can refer to this folder later
export me="$GROUP_SCRATCH/biochem_minicourse_2024/<your_dir>"

#move to the folder for this dataset
cd $me/data/external

# download the data (note that we use the tmp folder created earlier for the temporary files generated by this program)
Note: the following command takes a bit of time, so we've already downloaded the file into our directory. Use the cp command to copy from our dir to yours (this is the first 100000 lines from the full fastq file).
fasterq-dump SRR13403380 --progress --threads 2 --temp $me/tmp --outdir $me/data/external
cp $GROUP_SCRATCH/biochem_minicourse_2024/straightlab/data/external/SRR13403380_subset.fastq ./
```
Let's do some quick inspection of the data we downloaded.
```
ls -lh
```
We can see the previous command just downloaded a file called 'SRR13403380.fastq' wich is 897Mb in size
We can also see the subset file we copied from the straightlab directory 'SRR13403380_subset.fastq' is 16Mb
This is a fastq formatted file, which is the standard format for sequencing data. We can figure out how the file is organized by looking at the first 10 lines.
```
head SRR13403380_subset.fastq
```
This is of the form
```text
<@READID (first read) and some other read info>
<DNA sequence>
<+repeat of the READID line>
<Base call quality>
<@READID (next read) and some other read info>
etc..
```
So each read takes up 4 lines, with the sequence for read 1 at line2, sequence for read2 at line 4+2=6, etc...

## Basic qc
### The quick manual way (optional)
A lot of basic analysis of sequencing data can be done without relying on external software. It is extremely useful to know a few built-in linux commands that will come in handy in many contexts and especially for parsing raw sequencing data. 4 commands that are used often are `wc`, `sort`, `awk`, `uniq` (if you want to learn more commands you should also look at `cut`, `join` and `paste`). Let's demonstrate a real world application of these commands by doing the following qc analysis:

- how many reads are present in our dataset
- what is the read length distribution

We can see how many reads are in this file by counting the number of lines and dividing by 4

```
wc -l SRR13403380_subset.fastq
# 100000 SRR13403380_subset.fastq
# 100000/4 = 250000 reads
#we can do the same for the full file
wc -l /scratch/groups/astraigh/biochem_minicourse_2024/straightlab/data/external/SRR13403380.fastq
```

What is the length of each read? We need to count the number of characters for each sequence line. First let's extract the sequences from the file. `awk` is a great unix program to manipulate text files line by line. Here we just filter lines with a number modulo 4 = 2 (line numbers divided by 4 that give a remainder of 2).
Let's extract the first 3 sequences (3 multiplied by 4 lines = `head -n12`). The pipe operator `|` allows us to chain commands. The parenthesis in `awk` serves as an if statement. `$0` is a variable containing the whole line.

```
head -n12 SRR13403380_subset.fastq | awk '(NR%4==2){print $0}'
```

Now to get the length of these first three sequences, we just print the lengh of the line instead of the line itself

```
head -n12 SRR13403380_subset.fastq | awk '(NR%4==2){print length($0)}'
```

Finally, we want to build a histogram of how many times each read length is represented. For that, we need two useful commands: sort and uniq. Sort will just sort the lines, and uniq (which requires sorted input), will count (with flag `-c`) how many times each unique line occurs. We're ready to do that for the whole file rather than the first three sequences so let's replace `head` with `cat`, which just reads through the whole file

```
cat SRR13403380_subset.fastq | awk '(NR%4==2){print length($0)}' | sort | uniq -c
```

We can redirect the output to a file with the `>` operator. Let's call this file `readlength.hist.txt` . One last command we may want to add to the chain of commands is another `sort` command so that the histogram is sorted in such a way that the mode of the distribution comes first (the length that shows up the most often). This is achieved with

```bash
#create histogram and save it into a file
cat SRR13403380_subset.fastq | awk '(NR%4==2){print length($0)}' | sort | uniq -c | sort -k1,1nr > readlength.hist.txt
```

We can look at this file in a scrollable way with `less`.
```
less readlength.hist.txt
```

### QC with a dedicated tool fastqc and downloading data
The `fastqc` tool (preinstalled on the lab partition) can be used to get some other qc metrics, in particular about sequencing quality and overrepresentated sequences.

```
fastqc SRR13403380_subset.fastq -t 2
```

This produced an html file, which we need to download to our computer to look at. File transfer operations to and from Sherlock are best done using `rsync`

Open a new terminal tab in your computer, and then run
```
# replace <straightlab> with your folder name, and replace ~/Downloads if necessary
rsync -ah --progress <username>@dtn.sherlock.stanford.edu:/scratch/groups/astraigh/biochem_minicourse_2024/<straightlab>/data/external/SRR13403380_subset_fastqc.html ~/Downloads
```
The command has the form `rsync <option flags> [source path] [destination path]`. The optional flags `-vh` and `--progress` are just to tune the behavior of rsync and tell it to display progress in a nice way and `-a` is to preserve timestamps on files.

The source in on sherlock, and more specifically for file transfer we want to use a dedicated file transfer node on sherlock `dtn`. The target path in this example is your local Download directory.

Now let's take a look. Start a new terminal window on your local computer (not on sherlock) & run:
```
open ~/Downloads/SRR13403380_subset_fastqc.html
```

### NanoStat
Another simple tool which produces qc metrics more relevant to long read data is `NanoStat`

```bash
NanoStat --fastq SRR13403380_subset.fastq -n nanostat.summary

head nanostat.summary
```

Questions:
- What is the N50 for this dataset?
- What is the median Q score?
- what is the median error rate?

58 changes: 58 additions & 0 deletions 2024_edition/03_blastExample.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Taxonomic analysis

We start with a fastq file. We want to identify where the reads are coming from. We will start with some simple blast.


## Blast

Blast takes a fasta as an input, so we need to convert the fastq into fasta.

```bash
#recall your folder name
export me=$GROUP_SCRATCH/biochem_minicourse_2024/<your_dir>

#move to the folder for the dataset
cd $me/data/external

awk '(NR%4==1){print ">"$0; getline; print $0}' SRR13403380_subset.fastq > SRR13403380_subset.fasta
```

Let's look at the first 10 reads first
```
head -n 20 SRR13403380_subset.fasta > SRR13403380.head.fasta
```

We have multipe databases installed on sherlock, which correspond to some of those available on ncbi blast online. Here, because we know what type of samples we are dealing with we have downloaded just a few databases and combined them into one blastdb. The path of this database is
```
/scratch/groups/astraigh/biochem_minicourse_2024/straightlab/blastdb/combined_db
```

Let's run blast using this database. We will put the blast output in a `blast` folder

```bash
#create folder for output
mkdir -p blast

#we can set a variable for this blast database so we don't have to type out the full path everytime
myblastdb=/scratch/groups/astraigh/biochem_minicourse_2024/straightlab/blastdb/combined_db

#run blast
blastn -query SRR13403380.head.fasta -db $myblastdb -num_threads 2 > blast/SRR13403380_rawoutput.first10.txt
```

Taking a look at this file with `less` shows that this command returns the blast hits in a format similar to what we obtain when running blast online.

Note we can use "process substitution" with `<()` to get the first 10 reads on the fly without having to create an intermediate file

```
blastn -query <(head -n 20 SRR13403380_subset.fasta) -db $myblastdb -num_threads 2 > blast/SRR13403380_raw.output.first10.txt
```

Taking a look at this file with `less` shows that this command returns the blast hits in a format similar to the format we obtain when running blast online.

We can instead ask blast to produce a tabulated output, which is more computationally friendy, using the `--outfmt 6` flag.

```
blastn -query <(head -n 20 SRR13403380_subset.fasta) -db $myblastdb -num_threads 2 -subject_besthit -outfmt "6 stitle score" > blast/SRR13403380_tabulated.output.first10.txt
```

Loading

0 comments on commit 886ab82

Please sign in to comment.