Skip to content

Latest commit

 

History

History
126 lines (114 loc) · 3.43 KB

NGS_data.md

File metadata and controls

126 lines (114 loc) · 3.43 KB

NGS/HTS Data Practical

Derek W. Wright, MRC-University of Glasgow Centre for Virus Research Derek.Wright@glasgow.ac.uk

Overview

In this practical, we will be exploring the FASTA, multi-FASTA and FASTQ formats.

Linux Commands

Commands that you need to enter into the terminal window (command line) are presented in a box with a fixed-width font, like this:

ls

A few tips to remember:

  • Use the tab key to automatically complete filenames – especially long ones.
  • Use the up arrow to scroll through your previous commands, it enables you to easily re-run or re-use/adapt old commands.
  • Case Matters - the following file names are all different:
Myfile.txt
MyFile.txt
MYFILE.txt
myfile.txt
my file.txt
my_file.txt

Watch out for number 1 being confused with lowercase letter L, and capital letter O being confused with zero 0.

  • l = lower case L
  • 1 = number one
  • O = capital letter O
  • 0 = zero

Shorthand/wildcard symbols help to save typing:

  • ~ is shorthand for your home directory
  • . is shorthand for the current working directory
  • .. is shorthand for the directory above
  • * may be used as a wildcard to match file names

Setup

  • Login to Windows PC
  • Login to alpha2 server via MobaXTerm
ssh username@alpha2.cvr.gla.ac.uk
cd ~ 

~ is shortcut for home directory

File Formats

Dataset

cp -r /home3/dw73x/Formats .

. is shortcut for current working directory

cd Formats

View the FASTA Format

Nucleotide Sequence

less single_seq.fasta
  • Press space or f to scroll down through the file page by page
  • Press b to scroll back up a page
  • Press q to quit

Amino Acid Sequence

less protein.faa
  • Sars-CoV-2 spike protein amino acid sequence from NCBI

Multi-FASTA Format

less BabayanEtAl_sequences.fasta 
  • Press space or f to scroll down through the file page by page
  • Press b to scroll back up
  • Press q to quit
grep '>' BabayanEtAl_sequences.fasta
  • Search (grep) for lines with the > symbol in the file
grep '>' BabayanEtAl_sequences.fasta | wc -l
  • Search (grep) for lines with the > symbol in the file
  • Pipe (|) the results in to the next command
  • Word count (wc) the number of lines (-l)

View the FASTQ Format

less reads_R1.fastq
  • Press space or f to scroll through the file page by page
  • Press b to scroll back up a page
  • Press q to quit
grep '@SRR1553467.279000' reads_R1.fastq
  • Search (grep) for lines with string “SRR1553467.279000” (i.e. search for the read with the name SRR1553467.279000)
grep '@SRR1553467.279000' -A 3 reads_R1.fastq
  • As a FASTQ read consists of 4 lines, also return the 3 lines after (-A 3)
wc –l reads_R1.fastq
  • Word count (wc) the number of lines (-l) in the file (lines not reads)
  • Divide by 4 to get the number of reads
grep '^@SRR1553467' reads_R1.fastq | wc -l
  • Search (grep) for lines beginning (^) with the ‘SRR’ symbol in the file reads.fastq
  • Pipe (|) the results on to the next command
  • Word count (wc) the number of lines (-l)
  • Number of reads in the file

Compressed Files

FASTQ files are often gzipped (compressed) and have .fastq.gz extension Use commands zcat, zmore, zless, zgrep to access these compressed files

zless 00013_OS_L_NA_S1_R1_001.fastq.gz