Table of Contents
- helpful_commands
- awk
- Add a header line to a file
- Add up the values in a column
- Convert a CSV file to a FASTA file
- Create a new column from two existing columns
- Extract sequences from a multi-FASTA file based on title
- For each category in one column, add up the values in another column
- Print column names and numbers
- Print lines in file when a certain column contains a specific value
- Print lines where certain fields contain values of interest
- Print only specific columns, identified by name in the first row
- Print only the first non-commented line
- Print only the lines coming after a certain starting line and before a certain ending line
- Print the average read length for a FASTQ file
- Print the number of lines exhibiting each distinct number of fields
- Print the values observed in a specific column, along with the number of times each value is observed
- Replace certain values in specific columns
- Reverse the order of lines in a file using awk
- Skip footer lines
- Skip header lines
- Sort lines based on order of IDs in another file
- Write each row to a separate file named after the value in a specific column
- Bash
- Change Bash prompt temporarily
- Check Bash scripts for potential issues
- Extract a file
- Extract part of filename
- Perform a calculation on the command line
- Redirect STDERR to STDOUT and view both and append both to a file
- Run groups of commands in parallel using a Bash function and xargs
- Save the output of a command in a variable
- Set variables using values from a CSV file
- View STDOUT and append it to a file
- brew
- Conda
- Activate an environment
- Add additional packages to an environment
- Create a conda environment from a yaml file
- Create an environment and install some packages
- Deactivate an environment
- Export an environment to a yaml file
- Install Miniconda
- List available packages
- List environments
- List packages installed in the active environment
- Remove an environment
- Search for a specific package
- csvkit
- cut
- datamash
- Docker
- Annotate a bacterial genome using Bakta
- Annotate a bacterial genome using Prokka
- Compare sequence reads to a bacterial genome to find SNPs using Snippy
- Convert documents using Pandoc
- Delete all containers that are not running
- Delete all images
- Get an interactive bash shell in a Docker container
- Kill all running containers
- List images
- List running containers
- Override the entrypoint
- Perform a sequence comparison using legacy BLAST
- Set an environment variable inside a Docker container
- Stop a container
- FASTA and FASTQ files
- Add a FASTA title to the start of a sequence in RAW format
- Combine FASTA files into a single file replacing each FASTA record name with the name of the input file
- Convert a FASTA file to a CSV file with column names
- Count reads in compressed FASTQ files using Slurm and parallel
- Count the bases in a FASTQ file
- Count the reads in a FASTQ file
- Download a reference genome FASTA file from Ensembl
- Download FASTQ files based on a list of SRA accessions using Kingfisher
- Download FASTQ files based on a list of SRA accessions using the SRA Toolkit
- Download reference genome FASTA file and related files from NCBI
- Extract FASTA sequences from a file based on a file of sequence names of interest
- Merge compressed FASTQ files
- Obtain the flanking sequence of a site of interest from a FASTA file
- Process FASTQ files in pairs
- Process FASTQ files in pairs using parallel
- Reorder the sequences in a FASTA file based on a VCF file header
- Run FastQC on compressed FASTQ files using Slurm and a job array
- Split a multi-FASTA file into separate files named according to the sequence title
- File conversion
- Convert a website to PDF
- Convert between audiovisual file formats
- Convert between sequence file formats
- Convert CSV to Excel
- Convert CSV to HTML
- Convert CSV to Markdown
- Convert CSV to TSV
- Convert EPUB to PDF
- Convert DOCX to PDF
- Convert Excel to CSV using csvkit
- Convert GenBank to FASTA
- Convert GenBank to GFF
- Convert HTML to PDF
- Convert HTML to PNG
- Convert Markdown to HTML
- Convert Markdown to PDF
- Convert Markdown to PPTX
- Convert PDF to Markdown
- Convert PDF to PNG
- Convert PNG to PDF
- Convert PPTX to PDF
- Convert PPTX to PNG
- Convert SVG to PNG
- Convert TSV to CSV
- Convert TSV to Excel
- File downloads
- Download a complete web page as a single HTML file
- Download a GenBank file with bio
- Download a GenBank file with curl
- Download a reference genome GTF file from Ensembl
- Download an entire website
- Download files from a Globus endpoint using Globus CLI
- Download from an FTP server
- Download from Google Drive
- Download reference genomes and related files from NCBI
- Download sequences from the NCBI Assembly database
- find
- Copy the files returned by find
- Copy the files returned by find, naming the copies after a directory in the path
- Determine the total size of certain files using find
- Find large files
- Perform a series of commands on files returned by find
- Redirect output to one file
- Redirect output to separate files
- Remove files using find
- Run a command on multiple files at once
- Sort files by name before processing
- Sort the files by creation time before processing
- Switch to the directory containing each file and execute a command
- Use files as the argument list for a command
- Git
- Add or edit a remote repository
- Check the status of a working directory
- Clone all your repositories
- Create a new Git repository
- Create and merge Git branches
- List the files being tracked in the repository
- Mark changed files to be included in the next commit
- Move or rename a file or directory
- Pull a change from a remote repository to your local branch
- Push a commit on your local branch to a remote repository
- Remove files from the repository
- Save the marked files to the local Git repository
- Specify files to ignore
- Sync a repository to your local machine
- Tag a release
- Undo a Git add before a commit
- View the changes you haven't staged for commit
- View the commit history
- View the status of files in your working directory
- grep
- Image files
- join
- Mamba
- Activate an environment with mamba
- Add additional packages to an environment with mamba
- Create an environment and install some packages with mamba
- Create an environment from a yaml file with mamba
- Deactivate an environment with mamba
- Export an environment to a yaml file with mamba
- Install Mamba
- List available packages with mamba
- List environments with mamba
- List packages installed in the active environment with mamba
- Remove an environment with mamba
- Search for a specific package with mamba
- md5sum
- Miller
- Other
- Add a header to all files with a certain extension, getting the header from another file
- Add text or a header to the beginning of all files with a particular file extension
- Add text to the beginning of a file
- Browse, search, and edit a large CSV file
- Calculate coverage statistics for a BAM file
- Change the extension of multiple files
- Copy an ssh public key to another system
- Create a collection of MP3 files from a YouTube playlist
- Edit a PDF file
- Find common lines between files
- Format a CSV file into columns and examine its content
- Format code
- Obtain your public IP address and network information
- Perform a calculation using bc
- Perform a calculation using expr
- Perform a remote BLAST search
- Perform a calculation using qalc
- Prevent a command from stopping when you log out or exit the shell
- Remove duplicate files
- Reverse the order of lines in a file
- Run commands at scheduled times using cron
- Use SQL-like queries to work with a CSV or TSV file
- parallel
- paste
- Perl
- Count the number of lines that match a regular expression
- Format Perl code
- Get a random sample of lines from a text file while excluding the header line
- Print lines that match a pattern
- Print matches after additional editing
- Print matches that may span multiple lines
- Remove commas located within quoted fields in a CSV file and create a tab-delimited file
- Remove lines that match a pattern
- Replace ^M
- Replace commas with tabs
- Replace tabs with commas and remove quotes
- Search and replace text on each line
- Sort sections in a Markdown file based on headings
- Split a Markdown file into separate files based on a heading
- R
- Add columns from one tibble to another
- Add comment lines to output
- Change values in a column based on values in another column
- Cluster gene lists based on overlap and identify shared genes
- Combine multiple input files
- Compare two data sets to find differences
- Filter and sort rows
- Split multi-locus genotype strings into multiple columns and decode genotypes
- Split two-allele genotypes into two columns
- Transpose a data frame
- Understand an object
- Visualize the degree of overlap among gene sets
- rsync
- sed
- Share data with project group members
- Singularity
- Slurm
- Cancel a job
- Cancel all jobs
- Run a Snakemake workflow
- Run an nf-core Nextflow workflow
- Start an interactive session
- View accounting information for completed jobs
- View detailed information for a specific job
- View jobs
- View pending jobs
- View running jobs
- View statistics related to the efficiency of resource usage of a completed job
- sort
- tmux
- tr
- VCF files
- Add all INFO tags
- Add predicted consequences
- Add variant IDs
- Assess sex by calculating X-chromosome heterozygosity
- Assess sex using plink
- Change sample order
- Check relatedness between samples
- Combine rows / concatenate files from the same set of samples
- Convert a VCF file to an Excel file
- Count genotypes
- Count Mendelian errors using plink
- Count sites
- Determine genotype concordance
- Determine the proportion of missing genotypes in each sample
- Extract biallelic SNPs
- Extract indels
- Extract sites found in all VCF files
- Extract sites found in either but not both VCFs
- Extract sites from the first file that are found in the second
- Extract sites not found in a second VCF
- Extract SNPs
- Extract variants from a region of interest
- Extract variants from multiple regions of interest
- Extract variants from multiple regions of interest described in a file
- Extract variants where FILTER is PASS
- Extract variants with a missing ID
- Extract variants with an assigned ID
- Filter sites based on genotypes and other criteria
- Filter variants based on predicted consequences
- Find runs of homozygosity using plink
- Interpreting INFO tags
- Keep a subset of samples
- Keep sites from chromosomes and reheader based on reference genome
- Keep sites that are homozygous reference in all samples
- Merge files from non-overlapping sample sets
- Merge VCF files in batches using Slurm
- Merge VCF files in batches using using Slurm and a job array
- Perform case-control analysis
- Print samples
- Remove a sample
- Remove all genotypes
- Remove annotations
- Remove sites that are homozygous reference in all samples
- Remove sites with any missing genotypes
- Rename samples
- Transfer annotations from another VCF
- vim
- awk
Command-line tools, commands, and code snippets for performing routine data processing and bioinformatics tasks.
Table of Contents
- helpful_commands
- awk
- Add a header line to a file
- Add up the values in a column
- Convert a CSV file to a FASTA file
- Create a new column from two existing columns
- Extract sequences from a multi-FASTA file based on title
- For each category in one column, add up the values in another column
- Print column names and numbers
- Print lines in file when a certain column contains a specific value
- Print lines where certain fields contain values of interest
- Print only specific columns, identified by name in the first row
- Print only the first non-commented line
- Print only the lines coming after a certain starting line and before a certain ending line
- Print the average read length for a FASTQ file
- Print the number of lines exhibiting each distinct number of fields
- Print the values observed in a specific column, along with the number of times each value is observed
- Replace certain values in specific columns
- Reverse the order of lines in a file using awk
- Skip footer lines
- Skip header lines
- Sort lines based on order of IDs in another file
- Write each row to a separate file named after the value in a specific column
- Bash
- Change Bash prompt temporarily
- Check Bash scripts for potential issues
- Extract a file
- Extract part of filename
- Perform a calculation on the command line
- Redirect STDERR to STDOUT and view both and append both to a file
- Run groups of commands in parallel using a Bash function and xargs
- Save the output of a command in a variable
- Set variables using values from a CSV file
- View STDOUT and append it to a file
- brew
- Conda
- Activate an environment
- Add additional packages to an environment
- Create a conda environment from a yaml file
- Create an environment and install some packages
- Deactivate an environment
- Export an environment to a yaml file
- Install Miniconda
- List available packages
- List environments
- List packages installed in the active environment
- Remove an environment
- Search for a specific package
- csvkit
- cut
- datamash
- Docker
- Annotate a bacterial genome using Bakta
- Annotate a bacterial genome using Prokka
- Compare sequence reads to a bacterial genome to find SNPs using Snippy
- Convert documents using Pandoc
- Delete all containers that are not running
- Delete all images
- Get an interactive bash shell in a Docker container
- Kill all running containers
- List images
- List running containers
- Override the entrypoint
- Perform a sequence comparison using legacy BLAST
- Set an environment variable inside a Docker container
- Stop a container
- FASTA and FASTQ files
- Add a FASTA title to the start of a sequence in RAW format
- Combine FASTA files into a single file replacing each FASTA record name with the name of the input file
- Convert a FASTA file to a CSV file with column names
- Count reads in compressed FASTQ files using Slurm and parallel
- Count the bases in a FASTQ file
- Count the reads in a FASTQ file
- Download a reference genome FASTA file from Ensembl
- Download FASTQ files based on a list of SRA accessions using Kingfisher
- Download FASTQ files based on a list of SRA accessions using the SRA Toolkit
- Download reference genome FASTA file and related files from NCBI
- Extract FASTA sequences from a file based on a file of sequence names of interest
- Merge compressed FASTQ files
- Obtain the flanking sequence of a site of interest from a FASTA file
- Process FASTQ files in pairs
- Process FASTQ files in pairs using parallel
- Reorder the sequences in a FASTA file based on a VCF file header
- Run FastQC on compressed FASTQ files using Slurm and a job array
- Split a multi-FASTA file into separate files named according to the sequence title
- File conversion
- Convert a website to PDF
- Convert between audiovisual file formats
- Convert between sequence file formats
- Convert CSV to Excel
- Convert CSV to HTML
- Convert CSV to Markdown
- Convert CSV to TSV
- Convert EPUB to PDF
- Convert DOCX to PDF
- Convert Excel to CSV using csvkit
- Convert GenBank to FASTA
- Convert GenBank to GFF
- Convert HTML to PDF
- Convert HTML to PNG
- Convert Markdown to HTML
- Convert Markdown to PDF
- Convert Markdown to PPTX
- Convert PDF to Markdown
- Convert PDF to PNG
- Convert PNG to PDF
- Convert PPTX to PDF
- Convert PPTX to PNG
- Convert SVG to PNG
- Convert TSV to CSV
- Convert TSV to Excel
- File downloads
- Download a complete web page as a single HTML file
- Download a GenBank file with bio
- Download a GenBank file with curl
- Download a reference genome GTF file from Ensembl
- Download an entire website
- Download files from a Globus endpoint using Globus CLI
- Download from an FTP server
- Download from Google Drive
- Download reference genomes and related files from NCBI
- Download sequences from the NCBI Assembly database
- find
- Copy the files returned by find
- Copy the files returned by find, naming the copies after a directory in the path
- Determine the total size of certain files using find
- Find large files
- Perform a series of commands on files returned by find
- Redirect output to one file
- Redirect output to separate files
- Remove files using find
- Run a command on multiple files at once
- Sort files by name before processing
- Sort the files by creation time before processing
- Switch to the directory containing each file and execute a command
- Use files as the argument list for a command
- Git
- Add or edit a remote repository
- Check the status of a working directory
- Clone all your repositories
- Create a new Git repository
- Create and merge Git branches
- List the files being tracked in the repository
- Mark changed files to be included in the next commit
- Move or rename a file or directory
- Pull a change from a remote repository to your local branch
- Push a commit on your local branch to a remote repository
- Remove files from the repository
- Save the marked files to the local Git repository
- Specify files to ignore
- Sync a repository to your local machine
- Tag a release
- Undo a Git add before a commit
- View the changes you haven't staged for commit
- View the commit history
- View the status of files in your working directory
- grep
- Image files
- join
- Mamba
- Activate an environment with mamba
- Add additional packages to an environment with mamba
- Create an environment and install some packages with mamba
- Create an environment from a yaml file with mamba
- Deactivate an environment with mamba
- Export an environment to a yaml file with mamba
- Install Mamba
- List available packages with mamba
- List environments with mamba
- List packages installed in the active environment with mamba
- Remove an environment with mamba
- Search for a specific package with mamba
- md5sum
- Miller
- Other
- Add a header to all files with a certain extension, getting the header from another file
- Add text or a header to the beginning of all files with a particular file extension
- Add text to the beginning of a file
- Browse, search, and edit a large CSV file
- Calculate coverage statistics for a BAM file
- Change the extension of multiple files
- Copy an ssh public key to another system
- Create a collection of MP3 files from a YouTube playlist
- Edit a PDF file
- Find common lines between files
- Format a CSV file into columns and examine its content
- Format code
- Obtain your public IP address and network information
- Perform a calculation using bc
- Perform a calculation using expr
- Perform a remote BLAST search
- Perform a calculation using qalc
- Prevent a command from stopping when you log out or exit the shell
- Remove duplicate files
- Reverse the order of lines in a file
- Run commands at scheduled times using cron
- Use SQL-like queries to work with a CSV or TSV file
- parallel
- paste
- Perl
- Count the number of lines that match a regular expression
- Format Perl code
- Get a random sample of lines from a text file while excluding the header line
- Print lines that match a pattern
- Print matches after additional editing
- Print matches that may span multiple lines
- Remove commas located within quoted fields in a CSV file and create a tab-delimited file
- Remove lines that match a pattern
- Replace ^M
- Replace commas with tabs
- Replace tabs with commas and remove quotes
- Search and replace text on each line
- Sort sections in a Markdown file based on headings
- Split a Markdown file into separate files based on a heading
- R
- Add columns from one tibble to another
- Add comment lines to output
- Change values in a column based on values in another column
- Cluster gene lists based on overlap and identify shared genes
- Combine multiple input files
- Compare two data sets to find differences
- Filter and sort rows
- Split multi-locus genotype strings into multiple columns and decode genotypes
- Split two-allele genotypes into two columns
- Transpose a data frame
- Understand an object
- Visualize the degree of overlap among gene sets
- rsync
- sed
- Share data with project group members
- Singularity
- Slurm
- Cancel a job
- Cancel all jobs
- Run a Snakemake workflow
- Run an nf-core Nextflow workflow
- Start an interactive session
- View accounting information for completed jobs
- View detailed information for a specific job
- View jobs
- View pending jobs
- View running jobs
- View statistics related to the efficiency of resource usage of a completed job
- sort
- tmux
- tr
- VCF files
- Add all INFO tags
- Add predicted consequences
- Add variant IDs
- Assess sex by calculating X-chromosome heterozygosity
- Assess sex using plink
- Change sample order
- Check relatedness between samples
- Combine rows / concatenate files from the same set of samples
- Convert a VCF file to an Excel file
- Count genotypes
- Count Mendelian errors using plink
- Count sites
- Determine genotype concordance
- Determine the proportion of missing genotypes in each sample
- Extract biallelic SNPs
- Extract indels
- Extract sites found in all VCF files
- Extract sites found in either but not both VCFs
- Extract sites from the first file that are found in the second
- Extract sites not found in a second VCF
- Extract SNPs
- Extract variants from a region of interest
- Extract variants from multiple regions of interest
- Extract variants from multiple regions of interest described in a file
- Extract variants where FILTER is PASS
- Extract variants with a missing ID
- Extract variants with an assigned ID
- Filter sites based on genotypes and other criteria
- Filter variants based on predicted consequences
- Find runs of homozygosity using plink
- Interpreting INFO tags
- Keep a subset of samples
- Keep sites from chromosomes and reheader based on reference genome
- Keep sites that are homozygous reference in all samples
- Merge files from non-overlapping sample sets
- Merge VCF files in batches using Slurm
- Merge VCF files in batches using using Slurm and a job array
- Perform case-control analysis
- Print samples
- Remove a sample
- Remove all genotypes
- Remove annotations
- Remove sites that are homozygous reference in all samples
- Remove sites with any missing genotypes
- Rename samples
- Transfer annotations from another VCF
- vim
- awk
awk 'BEGIN{print "my header text"}1' input.txt
In this example values in column 5
are summed in a file with columns separated by one or more blank spaces:
awk -F' {1,}' '{sum+=$5} END {print sum}' input.txt
In this example column 1
contains the sequence title and column 3
contains the sequence:
awk -F, '{print ">"$1"\n"$3"\n"}' input.csv
In this example the values in columns 3
and 4
are added to create a new column:
awk -F, '{print $0,$3+$4}' input.txt
In this example the sequence titles to extract (X
and Y
) are listed in the file ids.txt
:
X
Y
The sequences are extracted from the file ARS-UCD1.2_Btau5.0.1Y.fa
:
awk -F'>' 'NR==FNR{ids[$0]; next} NF>1{f=($2 in ids)} f' ids.txt ARS-UCD1.2_Btau5.0.1Y.fa
In this example values in column 2
are summed up for each category in column 1
:
awk -F, '{a[$1]+=$2}END{for(i in a) print i,a[i]}' input.csv
In this example the first row of the input file contains the column names:
awk -F $'\t' 'NR>1{exit};{for (i = 1; i <= NF; i++) print "column " i,"is " $i}' input.tab
In this example lines are printed when the value in column 1
equals 9913
:
awk -F, '{if ($1 == 9913) print $0}' input.csv
To also print all lines starting with #
(header lines for example), use:
awk -F, '$1 ~ /^#/ {print; next} {if ($1 == 9913) print $0}' input.csv
Print the first line and lines when the value in column 1
equals 9913
:
awk -F, 'NR==1; NR > 1 {if ($1 == 9913) print $0}' input.csv
In this example the value in column 2
is printed when column 1
contains text that matches comp
:
awk -F, '$1 ~ /comp/ { print $2 }' input.csv
In this example lines where column 2
equals 7
and column 3
is between 60240145
and 60255062
are printed:
awk -F, '{ if ($2 == 7 && $3 >= 60240145 && $3 <= 60255062) print $0 }' input.csv
In this example the header line is printed, followed by lines where column 2
starts with MT-
and column 3
starts with MT-
:
awk -F $'\t' 'NR==1{print; next} $1~/^MT-/ && $2~/^MT-/' input.tab
In this example the columns named Affy SNP ID
and Flank
are printed:
awk -F, 'NR==1 { for (i=1; i<=NF; i++) { ix[$i] = i } } NR>1 { print $ix["Affy SNP ID"]","$ix["Flank"] }' input.csv
awk '/^[^#]/ { print $0;exit; }' input.txt
In this example the lines coming after a line starting with IlmnID
and before a line starting with [Controls]
are printed:
awk -F, '/^IlmnID/{flag=1;print;next}/^\[Controls\]/{flag=0}flag' input.csv
awk 'NR%4==2{sum+=length($0)}END{print sum/(NR/4)}' input.fastq
awk -F $'\t' '{count[NF]++}END{for(j in count) print "line length " j,"("count[j]" counts)"}' input.tab
Print the values observed in a specific column, along with the number of times each value is observed
In this example the counts for each distinct value in column 9
are printed:
awk -F $'\t' '{count[$9]++}END{for(j in count) print j"\t"count[j]}' input.tab
Same as above but skipping the first line and formatting the output to increase readability:
awk -F $'\t' 'NR>1{count[$9]++}END{for(j in count) printf "%-20s %s\n",j,count[j]}' input.tab
In this example the counts for each distinct pair of values in column 1
and column 2
are printed:
awk -F $'\t' '{count[$1"\t"$2]++}END{for(j in count) print j"\t"count[j]}' input.tab
In this example 1
and -1
in column 23
are replaced with forward
and reverse
, respectively:
awk -F\\t 'BEGIN {OFS = "\t"} {sub(/^1/, "forward", $23); sub(/^-1/, "reverse", $23); print}' input.tab
awk '{a[i++]=$0} END {for (j=i-1; j>=0;) print a[j--] }' input.txt
In this example the contents of the file are printed except for the last 15
lines:
awk 'NR>c{print A[NR%c]} {A[NR%c]=$0}' c=15 input.txt
The following skips lines starting with #
:
awk '/^[^#]/ { print $0 }' input.txt
The following skips the first 5
lines:
awk 'FNR > 5 { print $0 }' input.txt
In this example the records in file_to_sort.csv
have an identifier in column 1
that is present in sorted_ids.txt
, which has a single column:
awk -F, 'NR==FNR {a[$1]=$0; next} ($0 in a) {print a[$0]}' file_to_sort.csv sorted_ids.txt
If both files have a single header line the following can be used to generate the sorted output with the restored header line:
(head -n 1 file_to_sort.csv && awk -F, 'NR==FNR {a[$1]=$0; next} ($0 in a) {print a[$0]}' <(tail -n +2 file_to_sort.csv) <(tail -n +2 sorted_ids.txt)) > sorted.csv
In this example each file is named after the value in column 1
:
awk -F $'\t' '{ fname = $1 ".txt"; print >>fname; close(fname) }' input.tab
PS1="$ "
Use shellcheck:
shellcheck some_script.sh
The following Bash function can be used to extract a variety of file types:
extract() {
if [ -f "$1" ]; then
case "$1" in
*.tar.bz2) tar xjf "$1" ;;
*.tar.gz) tar xzf "$1" ;;
*.bz2) bunzip2 "$1" ;;
*.rar) unrar e "$1" ;;
*.gz) gunzip "$1" ;;
*.tar) tar xf "$1" ;;
*.tbz2) tar xjf "$1" ;;
*.tgz) tar xzf "$1" ;;
*.zip) unzip "$1" ;;
*.Z) uncompress "$1" ;;
*.7z) 7z x "$1" ;;
*) echo "'$1' cannot be extracted via extract()" ;;
esac
else
echo "'$1' is not a valid file"
fi
}
In this example the filename NS.2035.002.IDT_i7_9---IDT_i5_9.2032929-45-1_R1.fastq.gz
contains the sample name 9.2032929-45-1
.
To obtain the sample name:
file=NS.2035.002.IDT_i7_9---IDT_i5_9.2032929-45-1_R1.fastq.gz
# remove from the start up to and including the first i5_
without_upstream=${file#*i5_}
# remove from the end up to and including the first _
without_upstream_and_downstream=${without_upstream%_*}
# check the result
# 9.2032929-45-1
echo $without_upstream_and_downstream
Use Bash arithmetic expansion:
n=6
echo "$(( n - 1 * 2 ))"
answer="$(( n - 1 * 2 ))"
some_command 2>&1 | tee -a log
The following code processes sequence files using the filtlong
. Files are processed in parallel using xargs
and a Bash function called process_file
:
# Define the directory paths
input_dir="fastq-input"
output_dir="filtlong-output"
# Define the number of processes
num_processes=10
# Define a function to process each file
process_file() {
input="$1"
# Create a unique log file for this process
log_file="${output_dir}/$(basename "${input}").log"
# Extract the directory part from the log file path and create the directory
log_dir=$(dirname "${log_file}")
mkdir -p "${log_dir}" # Ensure the log file's directory exists
# Get the relative path of the input file
relative_path=${input#${input_dir}/}
# Get the directory and filename of the relative path
relative_dir=$(dirname "${relative_path}")
relative_file=$(basename "${relative_path}")
echo "Processing ${relative_file}..." >> "${log_file}"
# Define the output file
output="${output_dir}/${relative_dir}/${relative_file}"
# If the output file already exists, inform and skip
if [[ -f "${output}" ]]; then
echo "Output file ${output} already exists, skipping ${relative_file}..." >> "${log_file}"
return 0 # Skip this file
fi
# Create the corresponding directory in the output directory if it doesn't exist
mkdir -p "${output_dir}/${relative_dir}"
# Run filtlong and pipe the results into bgzip, then write the results to the output file
# Direct process output to individual log file
if filtlong --min_length 200 "${input}" 2>> "${log_file}" | bgzip > "${output}"; then
echo "Writing output to ${output}..." >> "${log_file}"
else
echo "Error processing ${input}..." >> "${log_file}"
fi
}
export -f process_file
# Find each .fq.gz file in the input directory and its subdirectories, then use xargs to process them in parallel
find "${input_dir}" -name "*.fq.gz" -print0 | xargs -0 -n 1 -P "${num_processes}" -I {} bash -c 'process_file "$@"' _ {}
The above reads compressed FASTQ files from an input directory and uses the filtlong
tool to filter reads based on sequence length.
The output generated for each input file is compressed using bgzip
and written to a separate file. The script creates an output directory structure to match the input directory structure. For example, if the input file is fastq-input/sample1/sample1.fq.gz
, the output file will be written to filtlong-output/sample1/sample1.fq.gz
.
For every file processed, a unique log file is created within the output directory.
Input files are not processed if their output folder already exist in the output directory.
Use a subshell:
result=$(echo "sqrt(16)" | bc -l)
The first read
in the example below is used to skip a header line.
{
read
while IFS="," read -r column1 column2 column3
do
echo "$column1, $column2, $column3"
done
} < "input.csv"
In this example gene names and accessions are read from a file and stored in parallel arrays that are then used to construct esearch
and efetch
commands that download the sequences.
Sample input, stored in queries.csv
:
narG,NP_415742.1
nirK,WP_010967660.1
nirS,NP_249210.1
nosZ,NP_252082.1
Read gene names and accessions from queries.csv
and store in parallel arrays:
gene=()
accession=()
while IFS=',' read -ra array; do
gene+=("${array[0]}")
accession+=("${array[1]}")
done < "queries.csv"
Check the contents of the arrays:
printf '%s\n' "${gene[@]}" "${accession[@]}"
Use the values stored in the arrays to construct esearch
and efetch
commands that download each sequence using the accession and save the sequence to a file named after the gene:
for i in "${!gene[@]}"; do
printf "processing gene %s and accession %s\n" "${gene[i]}" "${accession[i]}"
esearch -db protein -query "${accession[i]}[ACCESSION]" | efetch -format fasta > ${gene[i]}.faa
sleep 1
done
A simpler approach is to use parallel.
some_command | tee -a output.txt
In this example brewsci/bio
for bioinformatics software:
brew tap brewsci/bio
In this example the Firefox browser:
brew install firefox --cask
In this example parallel
:
brew install parallel
In this example clustal-w
from brewsci/bio
:
brew install brewsci/bio/clustal-w
brew list --cask
brew list
To view graphical applications available from the cask tap via the Homebrew package manager for macOS:
To view packages available from the core tap via the Homebrew package manager for macOS:
conda activate ngs
conda activate ngs
conda install -y -c bioconda -c conda-forge picard
conda env create --file env-environment.yaml
In this example an environment called ngs
is created:
conda create -y --name ngs
conda activate ngs
conda install -y -c bioconda -c conda-forge multiqc fastqc trimmomatic bowtie2 subread samtools
conda deactivate
Use the export
command while the environment is active:
conda env export > env-environment.yaml
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b
source ~/miniconda3/bin/activate
conda init
source ~/.bashrc
conda update -y -n base -c defaults conda
To install Conda with included support for Mamba, use Miniforge.
conda search -c bioconda -c conda-forge
conda info --envs
conda list
In this example the environment to remove is called my-env
:
conda deactivate
conda env remove --name my-env
conda search -c bioconda -c conda-forge blast
The csvkit examples below are taken from the csvkit documentation.
in2csv data.xls > data.csv
in2csv data.json > data.csv
csvjson data.csv > data.json
csvgrep -c phone_number -r "555-555-\d{4}" data.csv > new.csv
csvstat data.csv
csvjoin -c ID file1.csv file2.csv > inner_join_on_ID.csv
By default csvjoin
performs an inner join. Other joins can be performed using --outer
, --left
, and --right
.
csvcut -n data.csv
csvsql --query "select name from data where age > 30" data.csv > new.csv
csvcut -c column_c,column_a data.csv > new.csv
csvcut -c column_a,column_c data.csv > new.csv
cut
can also be used to change the field separator. In this example tab-delimited values are changed to comma-separated values:
cut -d $'\t' -f 1- --output-delimiter=',' bovine_genotypes.vcf
To switch from comma-separated to tab-separated use:
cut -d, -f 1- --output-delimiter=$'\t' sequenced_samples.csv
On platforms lacking --output-delimiter
perl
can be used to switch tabs to commas:
perl -p -e 's/\t/,/g' sequenced_samples.csv
To switch commas to tabs when --output-delimiter
isn't available the command below can be used. This script handles cases when commas are inside of quoted fields:
perl -nle 'my @new = (); push( @new, $+ )
while $_ =~ m{"([^\"\\]*(?:\\.[^\"\\]*)*)",?
| ([^,]+),? | ,}gx; push( @new, undef )
if substr( $text, -1, 1 ) eq '\'','\'';
for(@new){s/,/ /g} print join "\t", @new' sequenced_samples.csv
In this example columns 3
, 10
, 11
, and 12
are extracted from a tab-delimited file:
cut -d $'\t' -f 3,10-12 bovine_genotypes.vcf
Here cut
is used to extract the first three characters from each line:
cut -c 1-3 sequenced_samples.csv
In this example columns 1
and 2
are extracted from a from a CSV file:
cut -d, -f 1,2 sequenced_samples.csv
In this example all columns except column 1
are returned (the -f 2-
is used to mean "from field 2 to the end"):
cut -d $'\t' -f 2- bovine_genotypes.vcf
GNU datamash is a command-line program which performs basic numeric, textual and statistical operations on input textual data files.
In the following example the input CSV file has a header line. Records are grouped based on the value in column 2
, and for each group the mean value of column 5
is printed:
datamash -H -t, -g 2 mean 5 < example.csv
In the following example all the values in column 5
are printed for each group:
datamash -H -t, -g 2 collapse 5 < example.csv
In the following example a variety of statistics are generated for column 5
:
datamash -H -t, min 5 q1 5 median 5 q3 5 max 5 count 5 mean 5 sstdev 5 < example.csv
The following uses transpose to swap rows and columns in a CSV file with a header row:
datamash -H -t, transpose < example.csv
Download a Bakta Docker image:
docker pull oschwengers/bakta
Prepare the Bakta database (the --type
argument can be light
or full
):
docker run --rm -v "$(pwd)":/dir -u "$(id -u)":"$(id -g)" -w /dir -e PATH=/opt/conda/bin/ --entrypoint=/opt/conda/bin/bakta_db oschwengers/bakta download --output my_database --type light
Annotate the genome. In this example the genome sequence to be annotated is in a file called UAMS_ABS94.fna
, located in the current directory:
docker run --rm -v "$(pwd)":/dir -u "$(id -u)":"$(id -g)" -w /dir oschwengers/bakta --db my_database/db-light --output UAMS_ABS94 UAMS_ABS94.fna
Download a Prokka Docker image:
docker pull staphb/prokka:latest
Create a container from the image and run prokka
to annotate the sequence. In this example the genome sequence to be annotated is in a file called sequence.fasta
, located in the current directory, and four CPUs are used:
docker run --rm -v "$(pwd)":/dir -u "$(id -u)":"$(id -g)" -w /dir staphb/prokka:latest prokka sequence.fasta --cpus 4
Download a Snippy Docker image:
docker pull quay.io/biocontainers/snippy:4.6.0--hdfd78af_1
Create a container from the image and run snippy
to find SNPs. In this example the reference sequence is in a file called sequence.gbk
, and the reads are in J10_S210_R1_001.fastq
and J10_S210_R2_001.fastq
:
docker run -it --rm \
-u "$(id -u)":"$(id -g)" \
-v "$(pwd)":/directory \
-w /directory \
quay.io/biocontainers/snippy:4.6.0--hdfd78af_1 \
snippy --cpus 4 --outdir output --reference sequence.gbk --R1 J10_S210_R1_001.fastq --R2 J10_S210_R2_001.fastq
Download a Pandoc Docker image:
docker pull pandoc/extra
Create a container from the image and run pandoc
to convert a Markdown file to a PDF file. In this example the Markdown file is called example.md
and the PDF file is called example.pdf
:
docker run --rm -v "$(pwd)":/dir -u "$(id -u)":"$(id -g)" -w /dir \
pandoc/extra:latest \
example.md -o example.pdf \
--pdf-engine xelatex \
--template eisvogel.tex \
-H head.tex \
--highlight-style zenburn \
-M "colorlinks: TRUE" \
-M "code-block-font-size: \scriptsize" \
-M "date: `date +%Y-%m-%d`" \
-M "author=$AUTHOR" \
-M "title=$TITLE"
docker container rm $(docker ps -a -q)
docker system prune --all
docker run -it quay.io/biocontainers/snippy:4.6.0--hdfd78af_1 /bin/bash
Or
docker run --rm -it --entrypoint bash nidhaloff/igel
docker container kill $(docker ps -q)
Or to continually kill containers:
while true; do docker container kill $(docker ps -q); sleep 2; done
docker image ls
docker container ls
Use the --entrypoint
option to override the entrypoint. In this example the entrypoint is revised to /opt/conda/bin/bakta_db
and the list
option is passed to the bakta_db
command:
docker run --rm --entrypoint='/opt/conda/bin/bakta_db' oschwengers/bakta list
Download a legacy BLAST Docker image:
docker pull quay.io/biocontainers/blast-legacy:2.2.26--2
Create a container from the image and run formatdb
to create a formatted database. In this example the database is created from a DNA sequence file called database.fasta
, located in the current directory:
docker run -it --rm -v "$(pwd)":/directory -u "$(id -u)":"$(id -g)" -w /directory quay.io/biocontainers/blast-legacy:2.2.26--2 formatdb -i database.fasta -p F
To perform a blastn search using the formatted database and a query called query.fasta
when the file is also located in the current directory:
docker run -it --rm -v "$(pwd)":/directory -u "$(id -u)":"$(id -g)" -w /directory quay.io/biocontainers/blast-legacy:2.2.26--2 blastall -p blastn -d database.fasta -i query.fasta
To perform a blastn search using the formatted database and a query called query.fasta
when the query is located in a different directory (in this example your home directory):
docker run -it --rm -v "$(pwd)":/directory/database -v "${HOME}":/directory/query -u "$(id -u)":"$(id -g)" -w /directory quay.io/biocontainers/blast-legacy:2.2.26--2 blastall -p blastn -d database/database.fasta -i query/query.fasta
Use the -e
option to set an environment variable. In this example the PATH
environment variable is set to /opt/conda/bin/
:
docker run --rm -v "$(pwd)":/dir -u "$(id -u)":"$(id -g)" -w /dir -e PATH=/opt/conda/bin/ --entrypoint=/opt/conda/bin/bakta_db oschwengers/bakta download --output my_database --type light
docker container stop some_container
In this example the title >KL1
is added to the beginning of the sequence in KL1sequence.txt
:
perl -pi -e 'print ">KL1\n" if $. == 1' KL1sequence.txt
Combine FASTA files into a single file replacing each FASTA record name with the name of the input file
for file in *.fasta
do
echo ">$file" >> out.fasta
tail -n +2 "$file" >> out.fasta
done
Or:
for file in *.fasta
do
{
echo ">$file"
tail -n +2 "$file"
} >> out.fasta
done
cat input.fasta | perl -n -0777 -e 'BEGIN{print "SNP_Name,Sequence\n"}' -e 'while ($_ =~ m/^>([^\n]+)\n([^>]+)/gm) {$name = $1; $seq = $2; $seq =~s/\s//g; print $name . "," . $seq . "\n"}' > output.csv
In this example the number of reads in several .fastq.gz
files is determined by submitting jobs to Slurm using sbatch
. The advantage of this approach is that the counts for each file are generated in parallel, so that results can be obtained more quickly.
The naming scheme of the .fastq.gz
files is as follows (the sample name is in the file name, for example ctrl1
):
ctrl1_R1.fastq.gz ctrl2_R2.fastq.gz trtA1_R1.fastq.gz trtA2_R2.fastq.gz
ctrl1_R2.fastq.gz ctrl3_R1.fastq.gz trtA1_R2.fastq.gz trtA3_R1.fastq.gz
ctrl2_R1.fastq.gz ctrl3_R2.fastq.gz trtA2_R1.fastq.gz trtA3_R2.fastq.gz
First create an sbatch
script called count-reads.sbatch
(replace someuser
on the second line with your username):
#!/bin/bash
#SBATCH --account=def-someuser
#SBATCH --time=0:30:0
#SBATCH --ntasks=1
#SBATCH --mem=1000M
READS=$(expr $(zcat $1 | wc -l) / 4)
echo "$1 $READS"
The above uses zcat
to output the file contents to wc
which counts the lines. expr
then takes the count from wc
and divides by 4
to give the number of sequence reads.
parallel
can then be used to apply count-reads.sbatch
to each .fastq.gz
file. The --dryrun
option causes parallel
to print out the commands instead of running them. The --delay 1
inserts a one second delay between printing or running jobs. Use the following to print the sbatch
commands that will be run later on:
parallel --dryrun --delay 1 -j 1 \
"sbatch -o {1}.out -e {1}.err count-reads.sbatch {1}" ::: *.fastq.gz
The ::: *.fastq.gz
leads to one sbatch
command being constructed per .fastq.gz
file in the current directory.
Each instance of {1}
gets replaced with the full name of the .fastq.gz
file, such that each input file gets a unique and traceable output filename (so that results aren't overwritten and the relationships among files are clear).
To submit the jobs, run the parallel
command again, but without the --dryrun
option:
parallel --delay 1 -j 1 \
"sbatch -o {1}.out -e {1}.err count-reads.sbatch {1}" ::: *.fastq.gz
Once the jobs are submitted their statuses can be checked using squeue
(replace username
with your username):
squeue -u username
Each job creates two output files, for example:
ctrl1_R1.fastq.gz.out
ctrl1_R1.fastq.gz.err
The .out
files will contain the name of the input file and the number of reads, for example:
ctrl1_R1.fastq 77283
To quickly check the .err
files, which contain any errors or warnings generated by the jobs, use:
cat *.err | more
Once the jobs are complete the output files can be analyzed, e.g.:
cat *R1.fastq.gz.out | sort > R1.reads
cat *R2.fastq.gz.out | sort > R2.reads
paste R1.reads R2.reads
zcat < SRR13388732_1.fastq.gz | paste - - - - | cut -f 2 | tr -d '\n' | wc -c
Or:
cat SRR13388732_1.fastq | paste - - - - | cut -f 2 | tr -d '\n' | wc -c
file=SRR13388732_1.fastq
READS=$(expr $(cat $file | wc -l) / 4)
echo $READS
file=SRR13388732_1.fastq.gz
READS=$(expr $(zcat < $file | wc -l) / 4)
echo $READS
To obtain a list of files available for a particular species:
rsync --list-only rsync://ftp.ensembl.org/ensembl/pub/current_fasta/oncorhynchus_mykiss/dna/
To download one of the files:
rsync -av --progress rsync://ftp.ensembl.org/ensembl/pub/current_fasta/oncorhynchus_mykiss/dna/Oncorhynchus_mykiss.USDA_OmykA_1.1.dna.toplevel.fa.gz .
The following example uses Kingfisher, which can download data from the ENA, NCBI, AWS, and GCP.
The accessions are in a file named SRR_Acc_List.txt
and are passed to Kingfisher using parallel
. The --resume
and --joblog
options allows the command to be re-run without repeating previously completed jobs.
cat SRR_Acc_List.txt | parallel --resume --joblog log.txt --verbose --progress -j 1 'kingfisher get -r {} -m ena-ascp aws-http prefetch'
The following uses the SRA Toolkit.
Paired-end data:
cat SRR_Acc_List.txt | xargs -I{} fastq-dump -I --split-files --gzip {}
Single-end data:
cat SRR_Acc_List.txt | xargs -I{} fastq-dump --gzip {}
For better performance use fasterq-dump
(the following works for single-end and paired-end data):
cat SRR_Acc_List.txt | xargs -I{} fasterq-dump {}
gzip *.fastq
Note that fasterq-dump
will generate large cache files in ~/ncbi/public/sra
. If this directory does not have sufficient storage create a symbolic link to another directory that does, for example /scratch
:
mv ~/ncbi/public/sra /scratch/
ln -s /scratch/sra ~/ncbi/public/sra
SRA Explorer is an online resource that takes a list of accessions and returns a selectable list of ENA download URLs and sequencing run metadata.
Use the NCBI Datasets command-line tools.
The following downloads the mouse reference genome sequence:
datasets download genome taxon mouse --reference --include genome
To download the mouse reference genome and associated amino acid sequences, nucleotide coding sequences, gff3 file, and gtf file:
datasets download genome taxon mouse --reference --include genome,protein,cds,gff3,gtf
For more commands and options see the how-to guides.
In this example the sequence names of interest are in the file names.txt
and the FASTA sequences are in the file input.fasta
:
cat names.txt | xargs -I{} perl -w -076 -e '$count = 0; open(SEQ, "<" . $ARGV[0]); while (<SEQ>) {if ($_ =~ m/\Q$ARGV[1]\E/) {$record = $_; $record =~ s/[\s>]+$//g; print ">$record\n"; $count = $count + 1;}} if ($count == 0) {print STDERR "No matches found for $ARGV[1]\n"} elsif ($count > 1) {print STDERR "Multiple matches found for $ARGV[1]\n"} close(SEQ);' input.fasta {} > output.fasta
In the following example the sequences
folder contains paired-end data, in files like S00EC-0001_S30_1.fastq.gz
and S00EC-0001_S30_2.fastq.gz
. The goal is to merge all the forward reads into one file and all the reverse reads into another file.
cd sequences
rm -f merged_1.fastq.gz
rm -f merged_2.fastq.gz
find . \( -name "*_1.*" -name "*.fastq.gz" \) -type f \
| while IFS= read -r file1; do
# forward and reverse filename examples:
# S00EC-0001_S30_1.fastq.gz
# S00EC-0001_S30_2.fastq.gz
# construct name of other file
file2="${file1/_1./_2.}"
echo "Processing file '$file1' and '$file2'"
# check that number of lines match
lines_R1=$(zcat < "$file1" | wc -l)
lines_R2=$(zcat < "$file2" | wc -l)
if [ "$lines_R1" == "$lines_R2" ] ; then
echo "Both files contain $lines_R1 lines, proceeding with merge"
cat $file1 >> merged_1.fastq.gz
cat $file2 >> merged_2.fastq.gz
else
echo "WARNING:"
echo "$file1 contains $lines_R1 lines"
echo "$file2 contains $lines_R2 lines"
exit 1
fi
done
In this example samtools
is used to obtain the upstream and downstream 100
bases of flanking sequence of a SNP at position 42234774
on sequence Y
from the FASTA file ARS-UCD1.2_Btau5.0.1Y.fa
:
flank=100
pos=42234774
sequence=Y
fasta=ARS-UCD1.2_Btau5.0.1Y.fa
ustart=$(expr $pos - $flank)
uend=$(expr $pos - 1)
dstart=$(expr $pos + 1)
dend=$(expr $pos + $flank)
echo "Upstream flank:" && \
samtools faidx "$fasta" "$sequence":$ustart-$uend | perl -0777 -p -e 's/(?<=[A-Z])\n(?=[A-Z])//gm' && \
echo "SNP site" && \
samtools faidx "$fasta" "$sequence":$pos-$pos | perl -0777 -p -e 's/(?<=[A-Z])\n(?=[A-Z])//gm' && \
echo "Downstream flank:" && \
samtools faidx "$fasta" "$sequence":$dstart-$dend | perl -0777 -p -e 's/(?<=[A-Z])\n(?=[A-Z])//gm'
The perl
command is used to remove spaces within the sequence.
Sample output:
Upstream flank:
>Y:42234674-42234773
GGTGTTCAGGTTTGGGAACATGTGTACACCATGGCAGATTCATGTTGATGTATGGCAAAAGCAATACAATATTGTAAAGTAATTAACCTCCAATTAAAAT
SNP site
>Y:42234774-42234774
A
Downstream flank:
>Y:42234775-42234874
AATAAATTTATATTAAAAATGCATCAATTCTTTGGCACTCAGCTTTCTTCACATTCCAATACTCACATCCATACATGACTACTGGAAAATCATAGCCTTG
High-throughput sequencing data is often distributed as pairs of files corresponding to the two different read sets generated for each sample, e.g.:
6613_S82_L001_R1_001.fastq.gz
6613_S82_L001_R2_001.fastq.gz
70532_S37_L001_R1_001.fastq.gz
70532_S37_L001_R2_001.fastq.gz
k2712_S5_L001_R1_001.fastq.gz
k2712_S5_L001_R2_001.fastq.gz
To analyze data from multiple samples, the following while
loop code can be used. It iterates through the R1
files and from each filename constructs the matching R2
filename. Two useful variables called fnx
and fn
are also created for each file, storing the filename without the path to the file, and the filename without the path and without the file extension, respectively:
find . \( -name "*_R1_*" -name "*.fastq.gz" \) -type f \
| while IFS= read -r file; do
fnx=$(basename -- "$file")
fn="${fnx%%.*}"
# Construct name of other file
file2="${file/_R1_/_R2_}"
fnx2=$(basename -- "$file2")
fn2="${fnx2%%.*}"
# $file: ./6613_S82_L001_R1_001.fastq.gz
# $fnx: 6613_S82_L001_R1_001.fastq.gz
# $fn: 6613_S82_L001_R1_001
# $file2: ./6613_S82_L001_R2_001.fastq.gz
# $fnx2: 6613_S82_L001_R2_001.fastq.gz
# $fn2: 6613_S82_L001_R2_001
echo "Processing file '$fnx' and '$fnx2'"
# place commands here that work on file pairs
lines_R1=$(zcat < "$file" | wc -l)
lines_R2=$(zcat < "$file2" | wc -l)
if [ "$lines_R1" == "$lines_R2" ] ; then
echo "Both files contain $lines_R1 lines"
else
echo "WARNING:"
echo "$fnx contains $lines_R1 lines"
echo "$fnx2 contains $lines_R2 lines"
fi
done
Another option is to use parallel.
In the following example paired-end reads with names like sampleA_1.fastq.gz
and sampleA_2.fastq.gz
in a directory called data
are mapped to a reference called ref
using bowtie2
:
parallel -j2 "bowtie2 --threads 4 -x ref -k1 -q -1 {1} -2 {2} -S {1/.}.sam >& {1/.}.log" ::: data/*_1.fastq.gz :::+ data/*_2.fastq.gz
The {1/.}
removes the path and the extension from the first filename, leaving the sample name.
Here the VCF used for reordering is input.vcf.gz
and the FASTA file to be reordered is reference.fa
. The new file is called reference.reordered.fa
.
bcftools
, tabix
and samtools
are used:
# create index files for the VCF and FASTA files incase they don't already exist
tabix -p vcf input.vcf.gz
samtools faidx reference.fa
# extract the sequence names from the VCF header
bcftools view -h input.vcf.gz | grep "##contig" > sequence.dict
# reorder the FASTA file
SEQUENCES=$(cat sequence.dict | sed 's/.*ID=//' | sed 's/,.*//')
for SEQ in $SEQUENCES; do
samtools faidx reference.fa "$SEQ" >> reference.reordered.fa
done
Check the output:
grep ">" reference.reordered.fa
An alternative to running the sbatch
script for each input file using a loop or parallel
is to run it once using the --array=1-N
option where you replace N
with the number of input files you want to process.
The --array
option causes Slurm to create what is called a job array. The advantage of using this approach is that all the jobs are grouped together when viewed using squeue
, and job arrays can allow the scheduler to run jobs more efficiently.
When the --array
option is used, the contents of the batch script are run once for each input file. Within the batch script the input file is accessed using the variable $SLURM_ARRAY_TASK_ID
, which is set by Slurm to the current array index.
This example deals with .fastq.gz
files, named as follows (the sample name is in the file name, for example ctrl1
):
ctrl1_R1.fastq.gz ctrl2_R2.fastq.gz trtA1_R1.fastq.gz trtA2_R2.fastq.gz
ctrl1_R2.fastq.gz ctrl3_R1.fastq.gz trtA1_R2.fastq.gz trtA3_R1.fastq.gz
ctrl2_R1.fastq.gz ctrl3_R2.fastq.gz trtA2_R1.fastq.gz trtA3_R2.fastq.gz
Here is a batch script called fastqc.sbatch
, which is designed to process .fastq.gz
files using the fastqc
program.
#!/bin/bash
#SBATCH --account=def-someuser
#SBATCH --time=0:30:0
#SBATCH --ntasks=1
#SBATCH --mem=1000M
# Load required modules
module load fastqc
# This is to access the list of R1 files passed as an argument
R1_LIST="$1"
# Get the name of the R1 file to be processed from the list using
# the current array index
# e.g. ctrl1_R1.fastq.gz
R1_FILE="$(sed -n "${SLURM_ARRAY_TASK_ID}p" "$R1_LIST")"
# Construct the name of the R2 file to be processed
# e.g. ctrl1_R2.fastq.gz
R2_FILE="${R1_FILE/_R1/_R2}"
# Remove both extensions and the R1 to get the sample name
# e.g. ctrl1
SAMPLE_NAME="${R1_FILE/_R1.fastq.gz/}"
# Create the output folder for fastqc results for the sample
mkdir -p "fqc_$SAMPLE_NAME"
# Run fastqc on the R1 and R2 files and output to a folder
# named after the sample name
fastqc "$R1_FILE" "$R2_FILE" -o "fqc_$SAMPLE_NAME"
Replace someuser
with your username prior to using.
Next, generate a file of the R1 files to be processed and store the length of the list in a variable:
ls *R1.fastq.gz > R1.list
R1_COUNT=$(wc -l R1.list | cut -d' ' -f1)
echo $R1_COUNT
Finally, submit the job array:
sbatch --array=1-$R1_COUNT -o %A-%a.fqc.out -e %A-%a.fqc.err \
fastqc.sbatch R1.list
The --array
option is used to specify the range of array indices to be used. In this case the range is 1
to the number of R1 files. These values will be used to set the $SLURM_ARRAY_TASK_ID
variable in the batch script and to extract a line (using sed
) from the R1.list
file.
The -o
and -e
options are used to specify the files that will receive output and error messages. The %A
and %a
are replaced by the job array ID (assigned by Slurm) and the array index, respectively. This naming scheme ensures that each job has a unique output and error file.
The R1.list
file is passed as an argument to the batch script so that it can be accessed within the script as $1
.
To view a list of your submitted jobs (replace username
with your username):
squeue -u username
The job array will be listed as a single job, initially with a status of PD
(pending). Once the job starts running the status will change to R
(running). Once the job array is complete the status will change to CG
(completing).
The fastqc
results will be written to folders named after the sample names, for example fqc_ctrl1
. Output and error messages generated by the script will be written to files with names like 15268018-1.fqc.err
and 15268018-1.fqc.out
.
The 1
in the file names corresponds to the array index and the 15268018
is the job array ID. You can match up the array index to a specific file by looking at the R1.list
file. The first line of R1.list
corresponds to array index 1
, the second line corresponds to array index 2
, etc.
With fastqc
, the *.err
and *.out
files will normally contain progress messages generated by the program when it was running. These files contain what is called "standard error" and "standard output" respectively.
Suppose that a single job didn't complete because not enough time was requested. This can be determined by looking at the *.err
file for that job or by an absence of expected output, or by using sacct
with the job array ID (which is the prefix of the *.err
and *.out
files):
sacct -j 15268018 --format=JobID,JobName,State
The above command will show the status of the job array with ID 15268018
. State's that indicate a job didn't complete include CANCELLED
, FAILED
, and TIMEOUT
.
The job can be resubmitted using the same command as before, but with a different range of array indices. For example, to resubmit the first job:
sbatch --array=1 --time=1:00:00 \
-o %A-%a.fqc.out -e %A-%a.fqc.err \
fastqc.sbatch R1.list
The --time
option is used to specify the new time limit and overrides the value in the script.
To resubmit jobs 3 and 5:
sbatch --array=3,5 --time=1:00:00 \
-o %A-%a.fqc.out -e %A-%a.fqc.err \
fastqc.sbatch R1.list
In this example the sequences are written to a directory called out
:
outputdir=out/
mkdir -p "$outputdir"
awk '/^>/ {OUT=substr($0,2); split(OUT, a, " "); sub(/[^A-Za-z_0-9\.\-]/, "", a[1]); OUT = "'"$outputdir"'" a[1] ".fa"}; OUT {print >>OUT; close(OUT)}' input.fasta
The following uses wkhtmltopdf and gs:
url=https://sites.ualberta.ca/~stothard/; depth=1
wget --spider --force-html -r -l${depth} ${url} 2>&1 | grep '^--' | awk '{ print $3 }' | grep -i '\.\(html\|htm\)$' | uniq > url-list.txt
while read i; do wkhtmltopdf "$i" "$(echo "$i" | sed -e 's/https\?:\/\///' -e 's/\//-/g' ).pdf"; done < url-list.txt
gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress -sOutputFile=merged-output.pdf $(ls -lrt -1 *.pdf)
See ffmprovisr for commands to modify and covert audiovisual files.
There are numerous programs available for this task.
See bio and the example commands.
Some example commands taken from the bio documentation:
# fetch GenBank data
bio fetch NC_045512 MN996532 > genomes.gb
# convert the first ten bases of the genomes to FASTA.
bio fasta genomes.gb --end 10
# align the coding sequences for the S protein
bio fasta genomes.gb --gene S --protein | bio align | head
# print the GFF record that corresponds to the coding sequence for gene S
bio gff genomes.gb --gene S
# show the descendants of taxid 117565
bio taxon 117565 | head
# show the lineage of a taxonomic rank.
bio taxon 117565 --lineage | head
# get metadata on a viral sample
bio meta 11138 -H | head
# define a sequence ontology terms
bio define exon
# define a gene ontology terms
bio define food vacuole
The following uses ssconvert
, which is distributed with Gnumeric:
ssconvert input.csv output.xlsx
Or use VisiData:
vd input.csv -b -o output.xlsx
Use VisiData:
vd input.csv -b -o output.html
The following uses csv2md. The awk
command can be used to change missing values to .
:
awk 'BEGIN { FS = OFS = "," } { for(i=1; i<=NF; i++) if($i ~ /^ *$/) $i = "." }; 1' input.csv > temp.csv
csv2md -p < temp.csv | sed 's/_/\\_/g' > output.md
Or use VisiData:
vd input.csv -b -o output.md
perl -nle 'my @new = (); push( @new, $+ ) while $_ =~ m{"([^\"\\]*(?:\\.[^\"\\]*)*)",? | ([^,]+),? | ,}gx; push( @new, undef ) if substr( $text, -1, 1 ) eq '\'','\''; for(@new){s/,/ /g} print join "\t", @new' input.csv > output.tab
Or use VisiData:
vd input.csv -b -o output.tab
The following uses calibre:
ebook-convert input.epub output.pdf
The following uses LibreOffice:
soffice --headless --convert-to pdf --outdir . word_file.docx
The following uses pandoc and on macOS also requires basictex:
pandoc word_file.docx --output word_file.pdf
The following uses csvkit:
in2csv data.xls > data.csv
Or use VisiData. Change Sheet1
in the command below to match the name of the sheet to be converted:
vd input.xls +:Sheet1:1:1 -b -o output.csv
Use any2fasta:
any2fasta input.gb > output.fa
Install perl-bioperl and then use bp_genbank2gff3:
bp_genbank2gff3 input.gb --outdir output_directory
The following uses wkhtmltopdf:
wkhtmltopdf http://google.com google.pdf
The following uses wkhtmltoimage:
wkhtmltoimage -f png http://google.com google.png
Another approach, which may work better for complex web sites, is to use pageres-cli. The following creates an image that is 4485 pixels wide (5 * 897):
pageres http://google.com 897x1090 --crop --scale=5 --filename='google'
The command below uses pandoc and the pandoc.css
file available here.
pandoc -f markdown -t html -o output.html input.md --css=pandoc.css --self-contained
The command below uses pandoc and the eisvogel.tex template.
The head.tex
file consists of the following:
\definecolor{bgcolor}{HTML}{E0E0E0}
\let\oldtexttt\texttt
\renewcommand{\texttt}[1]{
\colorbox{bgcolor}{\oldtexttt{#1}}
}
pandoc input.md -o output.pdf --pdf-engine=xelatex --from markdown --template=eisvogel.tex --highlight-style zenburn -H head.tex
The command below uses pandoc. See the pandoc documentation on slide shows.
The --reference-doc
parameter is optional and can be used to supply a PowerPoint template.
pandoc input.md -o output.pptx --reference-doc template.potx
Use marker:
python convert_single.py input.pdf output.md
The following uses find
and the pdftoppm
command from the poppler package to generate a PNG image of the first page of every PDF file in the working directory:
find . -name "*.pdf" -exec pdftoppm -f 1 -l 1 -png -r 600 {} {} \;
The -r 600
option sets the resolution to 600 dpi. The -f 1
and -l 1
options specify the first and last pages to convert. The {}
is used to specify the input and output file names (they are passed twice to pdftoppm
from find
). pdftoppm
automatically appends the page number to the output file name.
The following uses ImageMagick:
convert *.png output.pdf
The following uses LibreOffice:
soffice --headless --convert-to pdf input.pptx
The following uses LibreOffice and ImageMagick to create a separate file for each slide (e.g. slide_1.png
, slide_2.png
, etc.):
soffice --headless --convert-to pdf input.pptx
convert -density 400 -resize 3000^ -scene 1 input.pdf slide_%d.png
The following uses svgexport:
SVGEXPORT_TIMEOUT=60 svgexport input.svg output.png 4000:
The SVGEXPORT_TIMEOUT=60
option sets the timeout to 60 seconds. The 4000:
option sets the width to 4000 pixels and the height is automatically calculated to maintain the aspect ratio.
awk 'BEGIN { FS="\t"; OFS="," } {
rebuilt=0
for(i=1; i<=NF; ++i) {
if ($i ~ /,/ && $i !~ /^".*"$/) {
gsub("\"", "\"\"", $i)
$i = "\"" $i "\""
rebuilt=1
}
}
if (!rebuilt) { $1=$1 }
print
}' input.tsv > output.csv
Or use VisiData:
vd input.tsv -b -o output.tab
Use VisiData:
vd input.tsv -b -o output.xlsx
Use SingleFile or single-file-cli.
The following uses single-file-cli to save https://www.wikipedia.org as wikipedia.html
in the current folder:
single-file https://www.wikipedia.org wikipedia.html
See the bio repository and examples.
To download two GenBank files and save as a single file:
bio fetch NC_045512 MN996532 > genomes.gb
The command below downloads a GenBank file from the Nucleotide database at NCBI:
i=NC_045512.2
curl -s "https://eutils.ncbi.nlm.nih.gov\
/entrez/eutils/efetch.fcgi?db=nucleotide\
&id=${i}&rettype=gbwithparts&retmode=txt" \
> $i.gbk
To obtain a list of files available for a particular species:
rsync --list-only rsync://ftp.ensembl.org/ensembl/pub/current_gtf/oncorhynchus_mykiss/
To download one of the files:
rsync -av --progress rsync://ftp.ensembl.org/ensembl/pub/current_gtf/oncorhynchus_mykiss/Oncorhynchus_mykiss.USDA_OmykA_1.1.106.gtf.gz .
wget --mirror --convert-links --page-requisites --no-parent https://www.somesite.com/course/material/
Note that the destination must be another Globus endpoint.
Install Globus CLI with pip and log into your Globus account
pip install globus-cli
globus login
Access the Globus Web App and note the UUIDs for the source and destination endpoints on the respective overview pages.
To transfer a single file:
globus transfer {source UUID}:/path/to/source/file {dest UUID}:/path/to/dest/file
To transfer all files in a directory:
globus transfer {source UUID}:/path/to/source/dir/ {dest UUID}:/path/to/dest/dir/ --recursive
Replace host
, account
, password
, and port
with their corresponding values in the following command:
wget -S -d -c -t 45 -v -r ftp://account:password@host:port/*
Rclone can be used to download data from many cloud storage providers.
First, follow the configuration instructions. The commands below assume that the remote system was named my_google_drive
during the configuration.
Note that you can use the Add shortcut to Drive
option in Google Drive to make folders and files in Shared with me
easier to access using rclone.
To list remote drives:
rclone listremotes
To list directories:
rclone lsd my_google_drive:
To list the contents of a directory
rclone ls my_google_drive:some_directory
To copy the drive to local storage:
rclone copy -P my_google_drive:some_directory ./some_directory
Alternatively, use Transmit or Google Drive for desktop.
Use the NCBI Datasets command-line tools.
The following downloads the mouse reference genome sequence:
datasets download genome taxon mouse --reference --include genome
To download the mouse reference genome and associated amino acid sequences, nucleotide coding sequences, gff3 file, and gtf file:
datasets download genome taxon mouse --reference --include genome,protein,cds,gff3,gtf
For more commands and options see the how-to guides.
The commands below download protein sequences derived from Psychrobacter assemblies in the NCBI Assembly database.
esearch
, efetch
, and xtract
can be installed using conda, e.g. conda install -c bioconda entrez-direct
.
esearch -db assembly -query '"Psychrobacter"[Organism] AND "latest refseq"[filter]' \
| efetch -format docsum \
| xtract -pattern DocumentSummary -element FtpPath_RefSeq \
| awk -F"/" '{print "curl -o "$NF"_protein.faa.gz " $0"/"$NF"_protein.faa.gz"}' \
| xargs -0 bash -c
Other file types can be downloaded by changing the _protein.faa.gz
portions of the awk
command. For example, use _genomic.fna.gz
to download genomic sequences, or _genomic.gff.gz
to download annotations of the genomic sequences in Generic Feature Format Version 3.
The example below is similar but includes the species name in the name of the output files (e.g. Psychrobacter-arcticus_GCF_000012305.1.faa.gz
instead of GCF_000012305.1_ASM1230v1_protein.faa.gz
):
esearch -db assembly -query '"Psychrobacter"[Organism] AND "latest refseq"[filter]' \
| efetch -format docsum \
| xtract -pattern DocumentSummary -element SpeciesName -element AssemblyAccession -element FtpPath_RefSeq \
| perl -ne 'if (m/([^\t]+)\t([^\t]+)\t([^\t\n]+)/) {$out="$1_$2"; $url=$3; $out=~s/ /-/g; if ($url=~m/([^\/]+)$/) {print "curl -o $out.faa.gz $url/$1_protein.faa.gz\n"}}' \
| xargs -0 bash -c
The following finds files in the directory output
ending in .vcf.gz
or .vcf.gz.tbi
and copies them to the vcfs
directory:
mkdir vcfs
find output -type f \( -name "*.vcf.gz" -o -name "*.vcf.gz.tbi" \) -exec cp {} vcfs \;
The command below finds files named star-fusion.fusion_candidates.preliminary
and parses the sample name from a directory name in the path to the file. The sample name is then used to construct a name for the copy. For example, ./231_S12_R1_001/star-fusion.fusion_candidates.preliminary
is copied to ./fusion-candidates/231_S12_R1_001.fusion_candidates.preliminary
.
find . -name star-fusion.fusion_candidates.preliminary -exec sh -c $'sample=$(perl -e \'if($ARGV[0] =~ m/^\.\/([^\/]+)/){print "$1\n"}\' $1); cp "$1" "./fusion-candidates/${sample}.fusion_candidates.preliminary"' -- {} \;
The command below reports the total size of files ending with .bam
(ignoring case). If more than one invocation of du
is required because the file list is very long, multiple totals will be reported and need to be summed.
find . -type f -iname '*.bam' -exec du -ch {} + | grep total$
The following will work on fewer platforms but will provide a single total:
find . -iname '*.bam' -exec du -cb {} + | grep -e "total$" | cut -f1 | paste -sd+ - | bc | numfmt --to=iec --suffix=B --round=towards-zero
The following reports the 10 largest files in the current directory or its subdirectories, sorted by size:
find . -type f -print0 | xargs -0 du -h | sort -hr | head -10
The command below finds .gff
files and then each file is processed as follows:
tail
is used to skip a header line.awk
is used to count the number of occurrences of each category in column 3 and print the category and counts.sort
is used to sort the categories by count from largest to smallest with ties broken by sorting on category name.- The results are redirected to a file named after the input file but with
.cog_counts.txt
appended.
find . -type f -name "*.gff" -print0 | xargs -0 -I{} sh -c $'tail -n +2 "$1" | awk -F $\'\t\' \'{count[$3]++}END{for(j in count) print j,count[j]}\' | sort -k 2,2nr -k 1,1> "$1.cog_counts.txt"' -- {}
The command below finds .stats
files and then each file is processed as follows:
count
is used to store the value in a field callednumber of records
, which is obtained usingperl
.echo
is used to print the filename andcount
.perl
is used to remove./
from the filename.- The filenames and counts for all the files are then sorted numerically by the count value, using
sort
. awk
is used to add a header row andcolumn
is used to format the output.
find . -name "*.stats" -exec sh -c $'count=$(perl -0777 -ne \'while (m/number of records:\s+?(\d+)/gm) {$out = $1; $out =~ s/\s//g; print "$out"}\' "$1"); echo "$1 $count" | perl -p -e \'s/\.\///g\'' -- {} \; | sort -k 2,2rn | awk 'BEGIN{print "file variants"}1' | column -t
Often it is simpler to use a loop to iterate through the files returned by find
. The following uses awk
to filter .maf
files:
wd=$(pwd)
find . -name "*.maf" -type f | while IFS= read -r file; do
dir=$(dirname -- "$file")
fnx=$(basename -- "$file")
fn="${fnx%.*}"
# $file: dir/filename.extension
# $dir: dir
# $fnx: filename.extension
# $fn: filename
echo "Processing file '$fnx' in directory '$dir'"
cd "$dir"
# always print the first two lines
# print lines where value in column 101 is PASS
awk -F $'\t' 'NR < 3; NR > 2 {if ($101 == "PASS") print $0}' "$fnx" > "${fn}.new"
mv "$fnx" "${fnx}.bak"
mv "${fn}.new" "$fnx"
cd "$wd"
done
To remove the .bak
files:
find . -name "*.bak" -type f -exec rm -rf {} \;
Print the number of lines in every .csv
or .tab
file in or below current directory and redirect the results to a single file:
find . -type f \( -name "*.csv" -o -name "*.tab" \) | while read f; do wc -l "$f" >> output.txt; done
Or:
find . -type f \( -name "*.csv" -o -name "*.tab" \) -exec wc -l {} \; > output.txt
Or:
find . -type f \( -name "*.csv" -o -name "*.tab" \) -print0 | xargs -0 -I{} wc -l {} > output.txt
Print the number of lines in every .csv
or .tab
file in or below current directory and redirect the results to separate files:
find . -type f \( -name "*.csv" -o -name "*.tab" \) | while read f; do wc -l "$f" > "${f}.output.txt"; done
Or:
find . -type f \( -name "*.csv" -o -name "*.tab" \) -exec sh -c 'wc -l "$1" > "$1.output.txt"' -- {} \;
Or:
find . -type f \( -name "*.csv" -o -name "*.tab" \) -print0 | xargs -0 -I{} sh -c 'wc -l "$1" > "$1.output.txt"' -- {}
The following removes files with a bam
extension:
find . -type f -name '*.bam' -exec rm {} \;
Use +
instead of \;
to run a command on multiple files at once. The following runs cat
on all .csv
and .tab
files in or below the current directory:
find . -type f \( -name "*.csv" -o -name "*.tab" \) -exec cat {} +
The following used find
to generate a list of .vcf.gz
files, which is then sorted based on sample number using sort
.
The files in this example are named DNA-1.vcf.gz
, DNA-2.vcf.gz
, etc. and the sort
is used to sort based on the number appearing after the first -
in the filename.
find . -name "*.vcf.gz" -type f -print0 | sort -z -k2,2n -t- | \
while IFS="" read -r -d "" file; do
fnx=$(basename -- "$file")
echo "'$fnx' contains these samples:"
bcftools query -l $file
done
Use ls
as in the following example:
find . -name "*.png" -type f -exec ls -rt "{}" + | while read file; do
echo "$file"
done
The -execdir option instructs find
to switch to the directory containing each matching file before executing the specified command.
In this example the command creates a .zip
file for each .vcf
file that is found:
find . -name "*.vcf" -type f -execdir zip '{}.zip' '{}' \;
In this example the command reads MD5 sums from MD5.txt
files, checks the MD5 sums, and writes the results to a file named checksum-results
:
find . -name "MD5.txt" -type f -execdir md5sum --check '{}' \; 2>&1 | tee -a checksum-results
The following finds files ending with .vcf.gz
, sorts the files based on the number appearing after the first -
in the filename, and prints the filenames separated by spaces by supplying all to echo
:
find . -name "*.vcf.gz" -type f -print0 | sort -z -k2,2n -t- | xargs -r0 echo
To print the filenames on separate lines, use:
find . -name "*.vcf.gz" -type f -print0 | sort -z -k2,2n -t- | xargs -r0 -L 1 echo
See GitHub's Git documentation for more information
To add a new remote:
git remote add origin <repo-url>
To edit an existing remote:
git remote set-url origin <new-repo-url>
To verify that the remote URL has changed:
git remote -v
git status
Use the official GitHub CLI tool gh
.
Change org_or_username
to your username to download all your repositories, or to your organization name to download all the repositories of your organization:
org_or_username=paulstothard
Authenticate with a GitHub host:
gh auth login
Clone up to 1000 repositories under the $org_or_username
folder:
gh repo list "$org_or_username" --limit 1000 | while read -r repo _; do
gh repo clone "$repo" "$repo" -- --recurse-submodules
done
Optionally, to clone new repositories and update existing ones:
gh repo list "$org_or_username" --limit 1000 | while read -r repo _; do
gh repo clone "$repo" "$repo" -- --recurse-submodules -q 2>/dev/null || (
cd "$repo"
git checkout -q main 2>/dev/null || true
git checkout -q master 2>/dev/null || true
git pull -q
)
done
git init
To view the branches in a repository:
git branch
To create a new branch and switch to it:
git checkout -b <new-branch>
To switch to a remote branch:
git fetch --all
git checkout <remote-branch>
After adding and committing some changes, to push this branch to remote:
git push -u origin <new-branch>
To merge a branch into main (local) and push the changes to remote:
git checkout main
git merge <new-branch>
git push -u origin main
Git merge conflicts can arise easily. For information on resolving a merge conflict, see Resolving a merged conflict using the command line.
The following will list all the files in the main branch:
git ls-tree -r main --name-only
To add one or more files:
git add <filename1> <filename2>
To add all current modifications in your project (including deletions and new files):
git add --all
git mv <filename-old> <filename-new>
git commit -m "Move <filename-old> to <filename-new>"
git pull <remote> <branch>
For example, to pull from the main branch:
git pull origin main
git push <remote> <branch>
For example, to push to the main branch:
git push -u origin main
Note that the following instructions will remove the file/directory from both the working tree and the index.
To remove one or more files:
git rm <filename>
git commit -m "Remove <filename>"
To remove a directory:
git rm -r <directory>
git commit -m "Remove <directory>"
To remove a file from the index (this untracks the file, it does not delete the file itself):
git rm --cached <filename>
git commit -m "Remove <filename>"
The commit should include a message using the -m option:
git commit -m "A concise description of the changes, e.g. 'Add tests for input file parsing'"
The following changes can be made to commits that have not been pushed to a remote repository. To rewrite the very last commit, with any currently staged changes:
git commit --amend -m "An updated message"
To commit any currently staged changes without rewriting the commit (this essentially adds the staged changes to the previous commit):
git commit --amend --no-edit
Create a .gitignore
file:
touch .gitignore
Add text or patterns to exclude:
echo sensitive_data.txt >> .gitignore
echo test/*.vcf >> .gitignore
git add .gitignore
git commit -m "add .gitignore file"
git push -u origin main
In this example, the following files will no longer be tracked: sensitive_data.txt
, and all files with a .vcf
extension in the directory test
.
Note that adding a .gitignore
file will not remove tracked files; this must be done with git rm
. See Remove files from the repository.
First, copy the clone URL on the GitHub repository page by clicking Clone or Download
. Then, enter the following command in a terminal window. The helpful_commands repository is used as an example:
git clone https://github.com/stothard-group/helpful_commands.git
A tag allows a specific release version of code to be identified, and creates a release that can be downloaded from GitHub. A tagged version serves as a snapshot that does not change.
git tag -a v2.0.0 -m "version 2.0.0"
git push origin v2.0.0
Information on how to choose version numbers if available here.
To undo a list of files:
git reset <filename1>
To undo all changes:
git reset
git diff
You can see past changes, including who made them and when:
git log
View the commit history with a list of the files that were changed:
git log --stat
View the commit history along with the actual changes (patches) made in each commit:
git log -p
git status
lets you see which changes have been staged, which haven't, and which files aren't being tracked by Git.
git status
In this example the number of lines with a match to >
is returned:
grep -c ">" input.fasta
In this example the line numbers of lines with a match to 234829
are reported:
grep -n "234829" input.txt
Keep everything except lines starting with #
:
grep -v '^#' input.txt
In this example .fasta
files are removed that contain the text complete genome
on a single line:
grep -l "complete genome" *.fasta | xargs -I{} rm -f {}
In this example .fasta
files are removed that do not contain the text complete genome
on a single line:
grep -L "complete genome" *.fasta | xargs -I{} rm -f {}
This example uses patterns from the text file ensembl_ids.txt
, which contains one gene ID per line:
ENSCAFG00000018897
ENSCAFG00000006950
ENSCAFG00000013069
ENSCAFG00000013670
ENSCAFG00000003247
The first grep
command is used to add the VCF header to the output file:
grep '^#' annotated_variants.vcf > annotated_variants.candidates.vcf
grep -f ensembl_ids.txt annotated_variants.vcf >> annotated_variants.candidates.vcf
The following requires yt-dlp, mplayer, ImageMagick, and gifsicle:
mkdir gif; cd gif
url=https://youtu.be/_YUAu0aP4DA
start=00:37; length=10
yt-dlp -f mp4 -o video_for_gif.mp4 $url
mplayer video_for_gif.mp4 -ao null -ss $start -endpos $length -vo png -vf scale=400:225
mogrify -format gif *.png
gifsicle --threads=2 --colors=256 --delay=4 --loopcount=0 --dither -O3 *.gif > animation.gif
The following uses ImageMagick to removes any edges that are exactly the same color as the corner pixels. A 30-pixel white border is then added to the sides and the top and bottom:
convert input.png -trim -bordercolor White -border 30x30 output.png
Use LICEcap.
Use Terminalizer to record a session. The following creates a file called session.yml
:
terminalizer record session
To convert the session to an animated GIF:
terminalizer render session
For more control over rendering options create an editable configuration file at ~/.terminalizer/config.yml
:
terminalizer init
The following uses ImageMagick to scale the image so that its width is 4000 pixels:
convert input.png -resize 4000 output.png
- Press
Command+Shift+4
. - Press the
space bar
. - Hold
Option
and click on the window.
To keep the drop shadow perform the last step without holding Option
.
To copy the screenshot to the clipboard instead of saving it to a file, use Command+Control+Shift+4
.
- Press
F12
to open the Firefox Developer Tools. - Enter
:screenshot
into the Web Console to download the current view as a PNG file.
To save a high-DPI webpage screenshot use :screenshot --dpr 4
.
To save a high-DPI full-page webpage screenshot use :screenshot --dpr 4 --fullpage
.
To delay the screenshot, to capture a menu for example, use :screenshot --dpr 4 --fullpage --delay 5
.
To copy the screenshot to the clipboard instead of saving it to a file, use :screenshot --clipboard
.
The following combines rows from two files based on shared identifiers and prints select columns from each file (using the -o
option). awk
is used to add column names to the output file:
join -t, -1 1 -2 1 -o 2.2,1.2,2.2,1.4,1.5,1.6 <(sort -t, -k1 file1.csv) <(sort -t, -k1 file2.csv) | awk -F, 'BEGIN{print "patient,status,sample,lane,fastq_1,fastq_2"}{print}' OFS=, > final_samplesheet.csv
join
is used to join lines from different files when they have matching join fields. join
expects the input files to be sorted.
Suppose file1.txt
contains the following text:
name,counts in sample 1
gene a,100
gene b,2223
gene e,575
Suppose file2.txt
contains this text:
counts in sample 2,name
2223,gene e
803,gene a
0,gene f
To join their contents based on name
, first sort the files based on the name
column (note that the sort field differs between the two commands):
(head -n 1 file1.txt && sort -t, -k 1 <(tail -n +2 file1.txt)) > file1.txt.sorted
(head -n 1 file2.txt && sort -t, -k 2 <(tail -n +2 file2.txt)) > file2.txt.sorted
Then use the join
command to combine the rows based on field 1
in the first file and field 2
in the second file:
join -t, -1 1 -2 2 file1.txt.sorted file2.txt.sorted
The above command returns the following:
name,counts in sample 1,counts in sample 2
gene a,100,803
gene e,575,2223
To format more nicely use:
join -t, -1 1 -2 2 file1.txt.sorted file2.txt.sorted | column -t -s,
Which gives:
name counts in sample 1 counts in sample 2
gene a 100 803
gene e 575 2223
To include genes present in just one of the files and to add a missing value character .
for the corresponding values that could not be obtained use:
join -t, -1 1 -2 2 -a1 -a2 -e . -o auto file1.txt.sorted file2.txt.sorted | column -t -s,
Which gives:
name counts in sample 1 counts in sample 2
gene a 100 803
gene b 2223 .
gene e 575 2223
gene f . 0
Another option is to use csvjoin from csvkit.
Mamba is a reimplementation of the conda package manager in C++.
mamba activate ngs
mamba activate ngs
mamba install -y -c bioconda -c conda-forge picard
In this example an environment called ngs
is created:
mamba create -y --name ngs
mamba activate ngs
mamba install -y -c bioconda -c conda-forge multiqc fastqc trimmomatic bowtie2 subread samtools
mamba env create --file env-environment.yaml
mamba deactivate
Use the export
command while the environment is active:
mamba env export > env-environment.yaml
conda install mamba -n base -c conda-forge
To install Conda with included support for Mamba, use Miniforge.
mamba search -c bioconda -c conda-forge
mamba env list
mamba list
In this example the environment to remove is called my-env
:
mamba deactivate
mamba env remove --name my-env
mamba search -c bioconda -c conda-forge blast
md5sum *.vcf.gz > md5sum.txt
md5sum --check md5sum.txt
Miller can be used to work with CSV, TSV, and JSON files.
Perform multiple actions sequentially using then
:
mlr --csv put '$New_coverage = ($Coverage / 100)' then sort -f Breed -nr Coverage then cut -f InterbullID,New_coverage example.csv
From CSV to JSON:
mlr --icsv --ojson cat example.csv
From CSV to TSV:
mlr --icsv --otsv cat example.csv
To view the put
documentation:
mlr put --help
The following converts Breed
to uppercase and divides Coverage
by 100:
mlr --csv put '$Breed = toupper($Breed); $Coverage = ($Coverage / 100)' example.csv
New columns can be created:
mlr --csv put '$New_coverage = ($Coverage / 100)' example.csv
To view the cut
documentation:
mlr cut --help
The following extracts the Breed
and Coverage
columns:
mlr --csv cut -f Breed,Coverage example.csv
Print the column names followed by the records:
mlr --csv head -n 10 example.csv
Print with added formatting for readability:
mlr --icsv --opprint head -n 10 example.csv
Print each value with the column name in the form column=value
:
mlr --icsv --odkvp head -n 10 example.csv
Print the column names followed by the records:
mlr --csv tail -n 10 example.csv
Print with added formatting for readability:
mlr --icsv --opprint tail -n 10 example.csv
Print each field with the column name in the form column=field
:
mlr --icsv --odkvp tail -n 10 example.csv
To view the filter
documentation:
mlr filter --help
The following filters records based on Breed
, Dataset
, and Coverage
:
mlr --csv filter '$Breed == "Charolais" && $Dataset == "A" && $Coverage > 6' example.csv
To view a complete list of Miller verbs use the following:
mlr -l
To view documentation for a particular verb use mlr _verb_ --help
.
To view the sort
documentation:
mlr sort --help
The following first sorts alphabetically by Breed
and then numerically by Coverage
(from largest to smallest):
mlr --icsv --opprint sort -f Breed -nr Coverage example.csv
To view the stats1
and stats2
documentation:
mlr stats1 --help
mlr stats2 --help
To view several stats for the Coverage
column:
mlr --icsv --opprint stats1 -a sum,count,min,max,mean,mode -f Coverage example.csv
In this example the header is added to .tab
files and comes from a file called header.txt
. The files with the header added are saved with a .new
extension added:
for f in *.tab; do new=`echo $f | sed 's/\(.*\)\.tab/\1.tab.new/'`; paste -sd'\n' \header.txt "$f" > "$new"; done
To replace the .tab
files the .new
files:
for f in *.new; do new=`echo $f | sed 's/\(.*\)\.new/\1/'`; mv "$f" "$new"; done
Add my header text
to the start of all .csv
files in the current directory (works on macOS):
find . -name "*.csv" -exec sed -i '.bak' '1s/^/my header text\'$'\n/g' {} \;
Add my header text
to the start of all .csv
files in the current directory (works on Linux):
find . -name "*.csv" -exec sed -i '1s/^/my header text\n/' {} \;
echo 'new header line' | cat - file.txt > temp && mv temp file.txt
Use DB Browser for SQLite (DB4S).
Use mosdepth. The following generates a variety of useful output files prefixed with 20079
:
mosdepth 20079 20079.markdup.sorted.bam
The following changes the .gbff
extension to .gbk
:
for f in *.gbff; do
mv -- "$f" "${f%.gbff}.gbk"
done
If rename
is available, this may work:
rename 's/\.gbff$/.gbk/' *.gbff
Or this, depending on which rename
is installed:
rename .gbff .gbk *.gbff
Generate the key pair:
ssh-keygen
Copy the public key to the .ssh/authorized_keys
file on the other system using ssh-copy-id
:
ssh-copy-id -i ~/.ssh/id_rsa.pub user@remote-host.com
The following requires yt-dlp and ffmpeg:
playlist_URL=https://www.youtube.com/playlist?list=PL92319EECC1754042
yt-dlp -x -i --audio-format mp3 --audio-quality 320K --embed-thumbnail --geo-bypass --rm-cache-dir --continue "$playlist_URL"
Use LibreOffice.
Between two files, named file1.txt
and file2.txt
:
comm -12 <( sort file1.txt ) <( sort file2.txt )
Among all .txt
files in the current directory:
number_of_files=$(find . -name "*.txt" -print | wc -l | sed 's/[^0-9]*//g')
cat *.txt | sort | uniq -c | sed -n -e "s/^ *$number_of_files \(.*\)/\1/p"
The perl
portion is used to handle empty fields:
cat data.csv | perl -pe 's/((?<=,)|(?<=^)),/ ,/g;' | column -t -s, | less -S
The Prettier program supports many languages. The following command uses the --write
option to reformat files in-place:
prettier --write "*html"
curl ifconfig.me/all
echo "2*(42+42)" | bc
expr 6 + 2 \* 5
The following uses BLAST+ to compare a protein sequence query against the nr
database:
in=ILFGCMKP_01075.faa
out=ILFGCMKP_01075_nr.tab
blastp -query "$in" -remote -db nr -out "$out" -outfmt '7 qseqid stitle sstart send qcovs qcovhsp pident evalue' -evalue 1e-10
qalc is a calculator with support for units. The following uses qalc to convert 1.5 hours to minutes:
qalc "1.5 hours to minutes"
See the qalc documentation for more examples.
Use nohup
:
nohup mycommand &
When using &
the bash job ID is shown in brackets and the PID (process ID), e.g.:
[1] 1963
The output of the command can be found in nohup.out
.
The command can be stopped using kill
and the PID, e.g.:
kill -9 1963
Use fdupes to find and remove duplicate files.
The following command finds duplicate files in the current directory and its subdirectories:
fdupes -r /path/to/directory
Use the -d
option to delete duplicates (you will be prompted to confirm each deletion):
fdupes -d -r /path/to/directory
To delete duplicates without being prompted, use the -N
option:
fdupes -dN -r /path/to/directory
tail -r input.txt > output.txt
Or:
tac input.txt > output.txt
The following uses cron
to run a script to copy various files and directories to a directory backed up by Dropbox.
Create the script the copy_to_dropbox.sh
script, editing as needed:
#!/bin/bash
PATH=/bin:/usr/bin/
home=/Users/myhome
target=${home}/Dropbox/backup
if [[ ! -e $target ]]; then
mkdir -p $target
elif [[ ! -d $target ]]; then
echo "$target already exists but is not a directory" 1>&2
fi
#copy files and directories of interest to $target
rsync --update -razv ${home}/.bash_profile $target/bash_profile
rsync --update -razv ${home}/lib $target
rsync --update -razv ${home}/bin $target
Test the script as follows:
chmod u+x copy_to_dropbox.sh
sudo env -i ./copy_to_dropbox.sh
Use crontab -e
to edit the crontab (cron table) file. Add the following to the crontab to run the script everyday at noon (changing the path to copy_to_dropbox.sh
):
0 12 * * * /path/to/copy_to_dropbox.sh >/dev/null 2>&1
Alternatively, use the following to run the script once per hour between 8 am and 5 pm on weekdays:
0 8-17 * * 1-5 /path/to/copy_to_dropbox.sh >/dev/null 2>&1
To display the crontab use crontab -l
.
The following uses q to count distinct values in the column named SNP
:
q -H -d, "SELECT COUNT(DISTINCT(SNP)) FROM ./input.csv"
Another option is to use csvsql from csvkit.
In the following example 5 jobs are run at the same time:
parallel -j5 "gzip {}" ::: *.csv
The following example uses Kingfisher, which can download data from the ENA, NCBI, AWS, and GCP.
The accessions are in a file named SRR_Acc_List.txt
and are passed to Kingfisher using parallel
. The --resume
and --joblog
options allows the command to be re-run without repeating previously completed jobs.
cat SRR_Acc_List.txt | parallel --resume --joblog log.txt --verbose --progress -j 1 'kingfisher get -r {} -m ena-ascp aws-http prefetch'
In the following command {.}
is used to get the basename and remove the last extension of each input file:
parallel 'zcat < {} > {.}.unpacked' ::: *.gz
Or:
parallel 'gunzip {}' ::: *.gz
In the following example there are protein sequence files ending with .faa
in a directory called queries
and BLAST databases ending with .faa
in a directory called databases
.
The parallel
command with --dryrun
is used to display the blastp
commands that will be run. The second parallel
command executes the commands.
The output from each BLAST search is written to a directory called output
and is named after the query and database files (the {1/.}
and {2/.}
remove the path and extension from the query and database filenames, respectively, when constructing the output filename).
querydir=queries
databasedir=databases
outdir=output
mkdir "$outdir"
parallel --dryrun -j 1 "blastp -evalue 0.0001 -query {1} -db {2} -out $outdir/{1/.}_{2/.}.tab" ::: "$querydir"/*.faa ::: "$databasedir"/*.faa
parallel --verbose -j 1 "blastp -evalue 0.0001 -query {1} -db {2} -out $outdir/{1/.}_{2/.}.tab" ::: "$querydir"/*.faa ::: "$databasedir"/*.faa
Using the local system:
cat multiple_sequences.fasta | parallel --block 100k --recstart '>' --pipe blastp -evalue 0.01 -outfmt 6 -db database.fa -query - > results
Using the local system (denoted as :
below) and a remote system called server1
(connection details provided in .ssh/config
):
cat multiple_sequences.fasta | parallel -S :,server1 --block 100k --recstart '>' --pipe blastp -evalue 0.01 -outfmt 6 -db database.fa -query - > results
In this example each line of queries.csv
consists of a gene name followed by a protein accession number, e.g. narG,NP_415742.1
. The accession is used to download a protein sequence, which is written to a file named after the gene name.
cat queries.csv | parallel -k -j 1 --colsep ',' 'esearch -db protein -query "{2}[ACCESSION]" | efetch -format fasta > "{1}.faa"'
In this example the columns of two files are joined. The first file is a CSV file the second is tab-delimited.
The -d ","
specifies that the lines are to be joined with commas.
paste -d "," genotype_conversion.csv SNP_Map.tab
Note that the content from SNP_Map.tab
still contains tab-delimited values.
To remove tabs from SNP_Map.tab
you could first create a CSV version of that input file:
cut -d $'\t' -f 1- --output-delimiter=',' SNP_Map.tab > SNP_Map.csv
paste -d "," genotype_conversion.csv SNP_Map.csv
It is possible to do it without first creating SNP_Map.csv
by using process substitution.
In the following the command between <(
and )
is first run and its output becomes input for paste
:
paste -d "," genotype_conversion.csv \
<(cut -d $'\t' -f 1- --output-delimiter=',' SNP_Map.tab)
perl -lne '$a++ if /\tyes\t/; END {print $a+0}' < input.txt
The following uses perltidy
to reformat the code in testfile.pl
and will create a file called testfile.pl.tdy
.
perltidy testfile.pl
In this example a random sample of 20 lines is obtained:
tail -n +2 input.txt | perl -MList::Util -e 'print List::Util::shuffle <>' | head -n 20 > output.txt
perl -ne '/ENSBTAG00000028061/ && print' input.txt
perl -0777 -ne 'while (m/^\s+\/translation="([^"]+)"/gm) {$out = $1; $out =~ s/\s//g; print "$out\n"}' NM_001271626.3.gbk
perl -0777 -ne 'while (m/^\s+\/translation="([^"]+)"/gm) {print "$1\n"}' NM_001271626.3.gbk
perl -nle 'my @new = (); push( @new, $+ ) while $_ =~ m{"([^\"\\]*(?:\\.[^\"\\]*)*)",? | ([^,]+),? | ,}gx; push( @new, undef ) if substr( $text, -1, 1 ) eq '\'','\''; for(@new){s/,/ /g} print join "\t", @new' input.csv > output.tab
In this example, lines from input.txt
are written to output.txt
unless they start with some_text
:
perl -n -e 'print unless m/^some_text/' input.txt > output.txt
perl -p -i -e 's/\r\n$/\n/g' file.txt
perl -p -e 's/,/\t/g;' input.csv > output.tab
perl -p -e 's/\t/,/g;' -e 's/"//g;' input.tab > output.csv
perl -p -e 's/Red Angus/Angus/g' sequenced_samples.csv
Note that multiple substitutions can be performed in succession, e.g.:
echo " test pattern" | perl -pe 's/^\s+//g;' -pe 's/ /,/g;'
The above command produces the following:
test,pattern
perl -0777 -ne '(undef,@paragraphs) = split /^#(?=[^#])/m; print map {"#$_"} sort { "\U$a" cmp "\U$b" } @paragraphs;' input.md
Given this input:
# Answer
some answer
# Answer
some more
text
# Answer
the
final
answer
The command below creates three files called output-1.md
, output-2.md
, and output-3.md
containing the text from the first, second, and third # Answer
sections, respectively.
For example, output-1.md
contains:
# Answer
some answer
perl -ne 'BEGIN {$section_name=shift; $output_file_prefix=shift; undef $/;} $section_count = 1; $file = $ARGV; while (m/(\Q$section_name\E.*?)((?=\Q$section_name\E)|$)/sg) {open(w, ">", "$output_file_prefix-$section_count.md"); print w "$1"; $section_count++; close(w)}' "# Answer" "output" input.md
library(purrr)
library(tibble)
# vcf and genotypes are tibbles
# add columns from genotypes to vcf
for (column in names(genotypes)) {
vcf %>%
add_column(!!(column) := genotypes[[column]]) ->
vcf
}
library(dplyr)
# vcf is a tibble
# add comment character to start of first column name
vcf <- rename(vcf, `#CHROM` = CHROM)
# write out comment line and then column names
writeLines(c("##fileformat=VCFv4.2", paste(names(vcf), collapse = "\t")), con = "genotypes.vcf")
write.table(vcf, file = "genotypes.vcf", row.names = FALSE, col.names = FALSE, quote = FALSE, sep = "\t", append=TRUE)
library(tidyverse)
tb <- tribble(
~chr, ~pos, ~sample1, ~sample2, ~A, ~B,
#----|-----|---------|---------|-----|-----
"1", 2, "AA", "AB", "REF","ALT",
"1", 12, "BB", "AA", "ALT","REF",
"1", 12, ".", "AA", "ALT","REF",
)
# get names of sample columns
names(tb) %>%
str_subset(pattern = "^sample") ->
columns_to_decode
# convert genotypes to values from A and B columns
tb %>%
mutate_at(
vars(one_of(columns_to_decode)),
list(~case_when(
. == "AA" & A == "REF" ~ "0/0",
. == "AA" & A == "ALT" ~ "1/1",
. == "BB" & B == "REF" ~ "0/0",
. == "BB" & B == "ALT" ~ "1/1",
. == "AB" ~ "0/1",
TRUE ~ './.'))) ->
tb_converted
print(tb_converted)
## A tibble: 3 x 6
# chr pos sample1 sample2 A B
# <chr> <dbl> <chr> <chr> <chr> <chr>
#1 1 2 0/0 0/1 REF ALT
#2 1 12 0/0 1/1 ALT REF
#3 1 12 ./. 1/1 ALT REF
A different approach, using nested ifelse():
library(tidyverse)
tb <- tribble(
~chr, ~pos, ~sample1, ~sample2, ~A, ~B,
#----|-----|---------|---------|-----|-----
"1", 2, "AA", "AB", "REF","ALT",
"1", 12, "BB", "AA", "ALT","REF",
"1", 12, ".", "AA", "ALT","REF",
)
# get names of sample columns
names(tb) %>%
str_subset(pattern = "^sample") ->
columns_to_decode
# function to convert genotypes to values from A and B columns
convert_genotypes <- function (df, col) {
df[[col]] <-
ifelse((df[[col]] == "AA") & (df[["A"]] == "REF"), "0/0",
ifelse((df[[col]] == "AA") & (df[["A"]] == "ALT"), "1/1",
ifelse((df[[col]] == "BB") & (df[["B"]] == "REF"), "0/0",
ifelse((df[[col]] == "BB") & (df[["B"]] == "ALT"), "1/1",
ifelse(df[[col]] == "AB", "0/1", "./.")))))
return(df)
}
for (sample in columns_to_decode) {
tb <- convert_genotypes(tb, sample)
}
print(tb)
## A tibble: 3 x 6
# chr pos sample1 sample2 A B
# <chr> <dbl> <chr> <chr> <chr> <chr>
#1 1 2 0/0 0/1 REF ALT
#2 1 12 0/0 1/1 ALT REF
#3 1 12 ./. 1/1 ALT REF
Yet another approach, by passing column names to a function that uses mutate() and case_when():
library(tidyverse)
tb <- tribble(
~chr, ~pos, ~sample1_1, ~sample1_2, ~FORWARD_A, ~FORWARD_B,
#----|-----|-----------|-----------|-----------|-----------
"1", 2, "G", "G", "G", "T",
"1", 12, "A", "A", "C", "A",
"1", 12, ".", "G", "T", "G",
)
# get names of sample columns
names(tb) %>%
str_subset(pattern = "^sample") ->
columns_to_decode
# function to convert genotypes to values from A and B columns
convert <- function(df, col, format) {
A <- paste(format, "A", sep = "_")
B <- paste(format, "B", sep = "_")
object <- df %>%
mutate(
!!col := case_when(
eval(parse(text = col)) == eval(parse(text = A)) ~ "A",
eval(parse(text = col)) == eval(parse(text = B)) ~ "B",
TRUE ~ "."
)
)
return(object)
}
for (sample in columns_to_decode) {
tb <- convert(tb, sample, "FORWARD")
}
print(tb)
## A tibble: 3 x 6
# chr pos sample1_1 sample1_2 FORWARD_A FORWARD_B
# <chr> <dbl> <chr> <chr> <chr> <chr>
#1 1 2 A A G T
#2 1 12 B B C A
#3 1 12 . B T G
In this example a heatmap is used to visualize gene presence and absence for all gene lists in the gene_lists
directory. In this directory each list is given as a separate .txt
file, with a single header row and one gene name or ID per row, for example:
Gene name or identifier
ENSG00000264954.2
ENSG00000224383.8
CCDS54157.1.1
The following uses the purrr
and RVenn
packages:
library(purrr)
library(RVenn)
setwd('/path/to/gene_lists')
filenames <- list.files(pattern = "*.txt", full.names = FALSE)
# create list of character vectors, each named after the source filename
# assumes each file has single header line (skip = 1)
gl <- sapply(filenames, scan, character(), sep="\n", skip = 1, USE.NAMES = TRUE)
# remove underscores from vector names
names(gl) <- gsub(x = names(gl), pattern = "_", replacement = " ")
# remove file extension from vector names
names(gl) <- gsub(x = names(gl), pattern = "\\..+?$", replacement = "")
venn = Venn(gl)
setmap(venn, element_fontsize = 4, set_fontsize = 4)
The resulting heatmap displays genes and gene lists as rows and columns, respectively. The columns and rows are arranged so that genes and gene lists with similar presence / absence patterns are grouped together.
In the following example the rows from multiple files are combined. Additional columns are used to track the source of each row. The resulting long data is converted into two different wide formats and written to a single Excel file as separate worksheets.
The input files are structured as follows, with additional columns not shown:
fusion_name, junction_read_count
Wfdc3--AABR07030443.2, 16
Wfdc3--AABR07030443.2, 2
Pdk4--Pdk2, 10
AABR07059159.1--Actr3b, 8
Each file is named after the source sample. For example 180_S1_R1_001.fusion_candidates.preliminary
is the data for sample 180_S1_R1_001
.
library(tidyverse)
library(janitor)
library(xlsx)
# directory containing the files
input_dir <- "data/input/"
# pattern to use when identifying input files
file_pattern <- "*.preliminary"
# create tibble of input files (full path and file name in separate columns)
input_files <-
tibble(
full_path = list.files(input_dir, pattern = file_pattern, full.names = TRUE),
file_name = list.files(input_dir, pattern = file_pattern, full.names = FALSE)
)
# function to add source info to each row
read_tsv_and_add_source <- function(file_name, full_path) {
read_tsv(full_path) %>%
clean_names %>%
mutate(file_name = file_name) %>%
mutate(full_path = full_path) %>%
# convert '180_S1_R1_001.fusion_candidates.preliminary' to '180_S1_R1_001'
mutate(sample = str_split(file_name, ".fusion", simplify = TRUE)[[1]])
}
# read all files into a single tibble
input_files %>%
rowwise() %>%
do(., read_tsv_and_add_source(.$file_name, .$full_path)) ->
combined_data_with_source
# group data by fusion_name and sample and for each group calculate the sum of
# junction_read_count
combined_data_with_source %>%
group_by(fusion_name, sample) %>%
summarise(fusion_count = sum(junction_read_count), .groups = NULL) ->
counts_per_fusion
# filter by fusion_name, keeping rows where fusion_name consists of two MT
# genes, for example Mt-co1--Mt-nd2
counts_per_fusion %>%
filter(str_detect(fusion_name, "^Mt-")) %>%
filter(str_detect(fusion_name, "--Mt-")) ->
counts_per_MT_fusion
# convert the data from long to wide, with values of sample becoming columns
counts_per_MT_fusion %>%
spread(sample, fusion_count) %>%
replace(is.na(.), 0) ->
samples_as_columns
# convert the data from long to wide, with values of fusion_name becoming
# columns
counts_per_MT_fusion %>%
spread(fusion_name, fusion_count) %>%
replace(is.na(.), 0) ->
fusions_as_columns
# write the data to an Excel file
write.xlsx(
as.data.frame(samples_as_columns),
"output.xlsx",
sheetName = "Samples as columns",
col.names = TRUE,
row.names = FALSE,
append = TRUE
)
write.xlsx(
as.data.frame(fusions_as_columns),
"output.xlsx",
sheetName = "Fusions as columns",
col.names = TRUE,
row.names = FALSE,
append = TRUE
)
In this example SNP location information assigned by two different algorithms is compared using the compareDF
package.
The two data sets (position information) generated by the differing algorithms are read from files. SNP name is used to match rows across the data sets, and then the 'chromosome' and 'position' are compared.
SNP records for which 'chromosome' or 'position' differ between the data sets are written to an Excel file.
library(compareDF)
algorithm1_results <- read.csv2("algorithm1_results.csv", comment.char = "#", sep = ",", header = TRUE)
algorithm1_results_snp_positions <- data.frame(algorithm1_results$marker_name, algorithm1_results$chromosome, algorithm1_results$position)
colnames(algorithm1_results_snp_positions) <- c('SNP_name', 'chromosome', 'position')
algorithm2_results <- read.csv2("algorithm2_results.csv", comment.char = "#", sep = ",", header = TRUE)
algorithm2_results_snp_positions <- data.frame(algorithm2_results$SNP_name, algorithm2_results$chromosome, algorithm2_results$position)
colnames(algorithm2_results_snp_positions) <- c('SNP_name', 'chromosome', 'position')
#compare positions between data sets, matching based on SNP_name
ctable = compare_df(algorithm1_results_snp_positions, algorithm2_results_snp_positions, c('SNP_name'))
output_file <- paste("positions", "algorithm1_results", "vs", paste("algorithm2_results", ".xlsx", sep=""), sep="_")
create_output_table(ctable, output_type="xlsx", file_name=output_file, limit=1000000)
library(tidyverse)
# vcf is tibble
# remove rows where POS is NA
vcf %>% drop_na(POS) ->
vcf
# keep rows with single base in REF and ALT
vcf %>%
filter(str_detect(REF, "^[GATCN]$")) %>%
filter(str_detect(ALT, "^[GATCN]$")) ->
vcf
# sort by chromosome then position
vcf %>%
arrange(CHROM, POS) ->
vcf
To convert this:
sample alleles
HOCANF12689774 1000112112
HOCANF12689787 2011112012
HOCANF12689790 1011002122
To this:
sample Hapmap43437-BTA-101873 ARS-BFGL-NGS-16466 Hapmap34944-BES1_Contig627_1906 ARS-BFGL-NGS-98142 Hapmap53946-rs29015852 ARS-BFGL-NGS-114208 ARS-BFGL-NGS-66449 ARS-BFGL-BAC-32770 ARS-BFGL-NGS-65067 ARS-BFGL-BAC-32722
HOCANF12689774 AB BB BB BB AB AB AA AB AB AA
HOCANF12689787 AA BB AB AB AB AB AA BB AB AA
HOCANF12689790 AB BB AB AB BB BB AA AB AA AA
Use this:
library(dplyr)
library(tidyr)
# prepare sample data frame
sample <- c('HOCANF12689774', 'HOCANF12689787', 'HOCANF12689790')
alleles <- c('1000112112', '2011112012', '1011002122')
genotypes <- data.frame(sample, alleles)
# vector of snp names
snps <- c('Hapmap43437-BTA-101873', 'ARS-BFGL-NGS-16466', 'Hapmap34944-BES1_Contig627_1906', 'ARS-BFGL-NGS-98142', 'Hapmap53946-rs29015852', 'ARS-BFGL-NGS-114208', 'ARS-BFGL-NGS-66449', 'ARS-BFGL-BAC-32770', 'ARS-BFGL-NGS-65067', 'ARS-BFGL-BAC-32722')
# create a new column for each digit in alleles
genotypes_one_column_per_snp <- separate(genotypes, col = alleles, sep = "(?<=\\d)", into = snps)
# convert each digit to textual representation
genotypes_one_column_per_snp_decoded = reshape2::dcast(
dplyr::mutate(
reshape2::melt(genotypes_one_column_per_snp, id.var = "sample"),
value=plyr::mapvalues(
value, c("0", "1", "2"), c("BB", "AB", "AA"))
),sample~variable)
write.table(genotypes_one_column_per_snp_decoded, file = "genotypes_one_column_per_snp_decoded.txt", row.names = FALSE, col.names = TRUE, quote = FALSE, sep = " ")
To convert this:
sample ABCA12 APAF1 ARS-BFGL-BAC-10172 ARS-BFGL-BAC-1020
sample1 AA CC GG AA
sample2 AA CC AG GG
To this:
sample ABCA12_1 ABCA12_2 APAF1_1 APAF1_2 ARS-BFGL-BAC-10172_1 ARS-BFGL-BAC-10172_2 ARS-BFGL-BAC-1020_1 ARS-BFGL-BAC-1020_2
sample1 A A C C G G A A
sample2 A A C C A G G G
Use this:
library(dplyr)
library(tidyr)
# prepare sample data frame
sample <- c('sample1', 'sample2')
ABCA12 <- c('AA', 'AA')
APAF1 <- c('CC', 'CC')
`ARS-BFGL-BAC-10172` <- c('GG', 'AG')
`ARS-BFGL-BAC-1020` <- c('AA', 'GG')
genotypes <- tibble(sample, ABCA12, APAF1, `ARS-BFGL-BAC-10172`, `ARS-BFGL-BAC-1020`)
# [-1] prevents first column from being split
for(column_name in names(genotypes)[-1]){
genotypes %>%
separate(column_name, c(paste(column_name, "1", sep = "_"), paste(column_name, "2", sep = "_")), sep = 1) ->
genotypes
}
write.table(genotypes, file = "genotypes_split.txt", row.names = FALSE, col.names = TRUE, quote = FALSE, sep = " ")
To convert this:
snp sample1 sample2
ABCA12 AA AA
APAF1 CC CC
ARS-BFGL-BAC-10172 GG AG
ARS-BFGL-BAC-1020 AA GG
To this:
snp ABCA12 APAF1 ARS-BFGL-BAC-10172 ARS-BFGL-BAC-1020
sample1 AA CC GG AA
sample2 AA CC AG GG
Use this:
library(dplyr)
library(tidyr)
library(janitor)
# prepare sample data frame
snp <- c('ABCA12', 'APAF1', 'ARS-BFGL-BAC-10172', 'ARS-BFGL-BAC-1020')
sample1 <- c('AA', 'CC', 'GG', 'AA')
sample2 <- c('AA', 'CC', 'AG', 'GG')
genotypes <- data.frame(snp, sample1, sample2)
genotypes %>%
tibble::rownames_to_column() %>%
pivot_longer(-rowname) %>%
pivot_wider(names_from=rowname, values_from=value) ->
genotypes_transposed
genotypes_transposed %>%
row_to_names(row_number = 1) ->
genotypes_transposed_with_column_names
write.table(genotypes_transposed_with_column_names, file = "genotypes_transposed_with_column_names.txt", row.names = FALSE, col.names = TRUE, quote = FALSE, sep = " ")
Or use this:
snp <- c('ABCA12', 'APAF1', 'ARS-BFGL-BAC-10172', 'ARS-BFGL-BAC-1020')
sample1 <- c('AA', 'CC', 'GG', 'AA')
sample2 <- c('AA', 'CC', 'AG', 'GG')
genotypes <- data.frame(snp, sample1, sample2)
genotypes_transposed <- data.frame(t(genotypes[-1]))
colnames(genotypes_transposed) <- genotypes[, 1]
write.table(genotypes_transposed, file = "genotypes_transposed_with_column_names.txt", row.names = FALSE, col.names = TRUE, quote = FALSE, sep = " ")
Or use datamash.
When working with R, especially in exploratory data analysis or when inheriting someone else's code, you might encounter objects of unknown type or structure. Below are some functions to help you inspect and understand such objects:
# Assume 'object' is the unknown object you're trying to inspect
# Check the class of the object
class(object)
#> "data.frame", "numeric", "factor", etc.
# Check the underlying type or storage mode
typeof(object)
#> "double", "integer", "list", "character", etc.
# Get the dimensions (if applicable, e.g., for matrices and data frames)
dim(object)
#> e.g., c(100, 5) for a data frame with 100 rows and 5 columns
# Check the length (number of elements or components)
length(object)
#> e.g., 100 for a vector of length 100
# View a compact structure of the object
str(object)
#> Outputs structure directly to the console
# Inspect all attributes of the object
attributes(object)
#> e.g., $names, $class, $row.names, etc.
# Get a statistical summary (if applicable)
summary(object)
#> Summary statistics for each column of a data frame, for example
# Preview the first few elements or rows of the object
head(object)
#> The first 6 elements for vectors or the first 6 rows for data frames
# Check if the object is of certain types
is.vector(object)
is.list(object)
is.matrix(object)
#> Returns TRUE or FALSE
In this example, an UpSet plot is used to visualize the overlap among all combinations of gene lists in the gene_lists
directory. In this directory each list is given as a separate .txt
file, with a single header row and one gene name or ID per row, for example:
Gene name or identifier
ENSG00000264954.2
ENSG00000224383.8
CCDS54157.1.1
The UpSet plot is generated using the UpSetR
package:
library(UpSetR)
setwd('/path/to/gene_lists')
filenames <- list.files(pattern = "*.txt", full.names = FALSE)
# create list of character vectors, each named after the source filename
# assumes each file has single header line (skip = 1)
gl <- sapply(filenames, scan, character(), sep="\n", skip = 1, USE.NAMES = TRUE)
# remove underscores from vector names
names(gl) <- gsub(x = names(gl), pattern = "_", replacement = " ")
# remove file extension from vector names
names(gl) <- gsub(x = names(gl), pattern = "\\..+?$", replacement = "")
upset(fromList(gl), nsets = length(gl), order.by = "freq")
The resulting plot displays the number of items shared among all possible combinations of overlapping sets in an easy-to-interpret and parse manner (unlike a traditional Venn diagram).
rsync -avzh user@192.168.0.101:~/source_directory destination
rsync -avzh source_directory destination_directory
To sync only the contents of source_directory
and not the directory itself use:
rsync -avzh source_directory/ destination_directory
rsync -avzh source_directory user@192.168.0.101:~/destination
sed $'1s/^/my header text\\\n&/' input
In this example chr30
is replaced with chrX
:
for f in *.fasta; do new=`echo $f | sed 's/chr30/chrX/'`; mv $f $new; done
The following deletes the first line:
sed '1d' sequenced_samples.csv
The following deletes from line 500 to the end of the file (represented by $
):
sed '500,$d' sequenced_samples.csv
In this example pos,reads_pp
is changed to pos,reads
in the first line of the file. The -i
is used to edit the file in place:
sed -i '' -e "1,1s/pos,reads_pp/pos,reads/" input.csv
In this example line 26404
is printed:
sed -n "26404p" input.txt
The sed
command can be used to find and replace text and supports its own regular expression syntax.
The following replaces all occurrences of Red Angus
with Angus
:
sed 's/Red Angus/Angus/g' sequenced_samples.csv
In the above the g
indicates that all matches on a line should be replaced.
The following restricts the processing to lines 302 to 305:
sed '302,305 s/Red Angus/Angus/g' sequenced_samples.csv
The following replaces the first occurrence of LIM
on each line with ---
:
sed 's/LIM/---/1' sequenced_samples.csv
The following adds underscores around the first three characters of each line:
sed 's/^\([a-zA-Z0-9]\{3\}\)/_\1_/g' sequenced_samples.csv
The above uses \(
and \)
to "remember" matching characters and then uses \1
to use the stored matching text in the replacement portion. Thus the replacement text becomes _
followed by the actual matched text, followed by _
.
The different parts of the search part of the command have the following meanings:
^
match the start of a line[a-zA-Z0-9]\{3\}
match three letters or numbers
cd projects/some_project
chmod g+x my_dir
cd my_dir
mkdir shared_dir
chmod g+x shared_dir
chmod +t shared_dir
In this example a container that can be downloaded from Docker Hub using docker pull pstothard/cgview
is used to generate a Singularity container:
singularity build cgview.sif docker://pstothard/cgview
To run a command in this container, use something like the following:
singularity exec -B /scratch cgview.sif java -jar /usr/bin/cgview.jar --help
The -B
is used to provide the container with access to directories.
scancel <jobid>
scancel -u <username>
The module load
command can be included within the shell
sections of a Snakemake workflow, to load the software needed for the commands that follow.
Sometimes different modules have conflicting software environments, so they are usually loaded only in the shell
sections where they are required, e.g.:
shell:
"""
module load emboss
# commands here that use emboss
"""
If there is a module that needs to be loaded for every rule, the shell.prefix()
command can be added to the top of the Snakefile. Note the semicolon and space before the final double quote in the value passed to shell.prefix()
:
shell.prefix("module load python/3.8; ")
Resource requests are specified for each rule where the jobs are submitted to the cluster, using the resources
parameter. In the following example, the cores
value specifies the number of cores for a multi-threaded application, mem_mb
is the amount of memory requested in megabytes, and runtime
is the walltime requested, in minutes:
resources:
cores = 1,
mem_mb = 100,
runtime = 30
Rules that run commands that require few resources to run (e.g. downloading a small file) can be marked as localrules
, which causes Snakemake to run them directly on the login instead of submitting them to the job queue for execution on a compute node:
localrules: gc_content, some_other_rule
Here is a simple Snakemake workflow (saved in a file called Snakefile
) for calculating the GC content of sequences:
accessions, = glob_wildcards("fasta_files/{accession}.fa")
rule final:
input:
"result/gc_table.txt"
rule gc_content:
input:
"fasta_files/{accession}.fa"
output:
"gc/{accession}_gc.txt"
params:
"{accession}"
resources:
cores = 1,
mem_mb = 100,
runtime = 30
shell:
"""
module load emboss
GC=$(infoseq -noheading -auto -only -pgc {input})
echo "{params}: $GC" > {output}
"""
rule collect_gc:
input:
expand("gc/{accession}_gc.txt", accession=accessions)
resources:
cores = 1,
mem_mb = 100,
runtime = 30
output:
"result/gc_table.txt"
shell:
"cat {input} > {output}"
Install Snakemake using pip:
module load python
pip install snakemake --user
Once installed, the snakemake
command can be used to run the workflow. You don't need to reinstall Snakemake each time you want to run a workflow. The --user
option used above installs Snakemake in your home directory.
To run the workflow, use the snakemake
command with the --cluster
parameter:
sbc="sbatch -N 1 -c {resources.cores}\
--mem={resources.mem_mb}\
--time={resources.runtime}\
--account=def-someuser"
snakemake --cluster "$sbc" --printshellcmds -j 10
When the above command is run, Snakemake will display progress to the screen as it submits jobs. Typically you would use tmux
or screen
to run Snakemake in a terminal session that you can disconnect from and reconnect to later, however this workflow is simple enough that it should complete in a few minutes.
Note that a --dry-run
option can be added to a snakemake
command to see what jobs will be run but without actually executing them.
The --printshellcmds
option used above is useful for troubleshooting as it prints to the screen exactly what is submitted by the shell
section of a rule. Here is an example of the rule gc_content
shell commands printed to the screen:
module load emboss
GC=$(infoseq -noheading -only -pgc fasta_files/NC_005831.2.fa)
echo "NC_005831.2: $GC" > gc/NC_005831.2_gc.txt
Using the -j
flag to limit the number of currently submitted or running jobs can help prevent reductions in your account's priority on the cluster. This option is especially important when running a workflow that has the potential of submitted hundreds of jobs simultaneously.
Once the jobs are submitted their status can be checked:
squeue -u $USER
Once the workflow has completed, the results can be viewed:
more result/gc_table.txt
The nf-core project provides a collection of Nextflow workflows for bioinformatics.
In the following example the nf-core/sarek workflow for detecting germline or somatic variants is used to process a test data set.
See the DRAC Nextflow documentation for more information on running Nextflow workflows on Alliance clusters.
First, load some modules and install nf-core tools using a Python virtual environment and pip (this step is slow but only needs to be done once). The nf-core tools package can be used to download nf-core pipelines and dependencies:
module purge && module load python/3.8 && module load postgresql/15.3
python -m venv $HOME/nf-core-env # create a virtual environment
source $HOME/nf-core-env/bin/activate # activate the environment
python -m pip install nf_core==2.6 # install nf-core tools in the environment
In the future you can activate the environment using:
source $HOME/nf-core-env/bin/activate
To deactivate the environment use:
deactivate
To view a list of available nf-core pipelines use the following:
source $HOME/nf-core-env/bin/activate
nf-core list
Set environment variables to store the name of the pipeline we are going to run and its version:
export NFCORE_PL=sarek; export PL_VERSION=3.2.0
sarek
is the name of the pipeline, and 3.2.0
is the version. The export
command allows you to set environment variables that will be available to any commands you run in the current shell session, including jobs submitted to the cluster.
If you log out of the cluster you will need to set these variables again for some of the commands below to work.
Create a folder to hold Apptainer/Singularity containers. Apptainer/Singularity is a containerization platform like Docker. These containers will provide the programs that the pipeline will execute as it runs (e.g. bwa
and samtools
):
mkdir -p ~/scratch/singularity
Set an environment variable to tell nf-core tools and Nextflow where we want the containers stored (both programs use the $NXF_SINGULARITY_CACHEDIR
environment variable for this purpose):
export NXF_SINGULARITY_CACHEDIR=~/scratch/singularity
Download the pipeline and the containers it uses:
cd ~/scratch
module load nextflow/22.10.6
module load apptainer/1.1.6
nf-core download --singularity-cache-only --container singularity \
--compress none -r ${PL_VERSION} -p 6 ${NFCORE_PL}
The pipeline code is downloaded to the working directory whereas the containers are downloaded to the folder specified by the NXF_SINGULARITY_CACHEDIR
environment variable.
With the pipeline and containers downloaded, we can now run the pipeline using a test data set provided by the pipeline developers.
Create a Nextflow configuration file called nextflow.config
containing the following:
singularity{
autoMounts = true
}
process {
executor = 'slurm'
pollInterval = '60 sec'
submitRateLimit = '60/1min'
queueSize = 100
errorStrategy = 'retry'
maxRetries = 1
errorStrategy = { task.exitStatus in [125,139] ? 'retry' : 'finish' }
memory = { check_max( 4.GB * task.attempt, 'memory' ) }
cpu = 1
time = '3h'
}
profiles {
beluga{
max_memory='186G'
max_cpu=40
max_time='168h'
}
narval{
max_memory='249G'
max_cpu=64
max_time='168h'
}
cedar{
max_memory='124G'
max_cpu=32
max_time='168h'
}
}
To create this file using vim
enter vim nextflow.config
and then in vim
use :set paste
to enter paste mode. Next, right-click to paste the above, then enter :set nopaste
to exit paste mode, and enter :wq
to save and exit.
Nextflow itself uses too many resources to be run on the login node, so we will run it as a job. Create a batch script called sarek.sbatch
with the following contents:
#!/bin/bash
#SBATCH --job-name=sarek
#SBATCH --output=sarek_%j.out
#SBATCH --error=sarek_%j.err
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16000M
#SBATCH --time=3:00:00
module load nextflow/22.10.6
module load apptainer/1.1.6
nextflow run nf-core-${NFCORE_PL}-${PL_VERSION}/workflow/ \
--clusterOptions "--account=def-${USER}" \
-c nextflow.config \
-ansi-log false -resume \
-profile test,singularity,cedar --outdir sarek_output
To create in vim
enter vim sarek.sbatch
and then in vim
use :set paste
to enter paste mode. Right-click to paste the above, enter :set nopaste
to exit paste mode, and enter :wq
to save and exit.
The script loads two modules that are needed to run Nextflow and the Apptainer/Singularity containers.
The nextflow run
command is used to run the workflow we downloaded earlier.
The --clusterOptions
option is used to pass options to the Slurm job scheduler. In this case we are requesting that the job be run using our default allocation.
A nextflow.conf
file provides information to help Nextflow run jobs on the cluster. This configuration file is passed to Nextflow using the -c
option and can be reused for other Nextflow workflows.
The -ansi-log false
option is used to disable the use of ANSI escape codes in the log file. This is useful when viewing the log file in a text editor.
The -resume
option is used to tell Nextflow to resume a previous run if it was interrupted. This option is useful when running a workflow that takes a long time to complete, since any intermediate results that were generated will be used instead of being regenerated.
The --profile
option is used to specify that test data will be used, that Apptainer/Singularity containers will be used, and that the pipeline will be run using information provided in the cedar
section of the nextflow.config
file.
The --outdir
option specifies the folder where the pipeline results will be written.
Submit the batch job:
sbatch --account=def-${USER} sarek.sbatch
To view the status of the job:
squeue --format="%i %u %j %t" -u $USER | column -t
Once this job is running, Nextflow will start submitting its own jobs, for the various steps of the pipeline. These jobs will be included in the output of the squeue
command. You can log out of the cluster and the pipeline will continue to run.
Once all the jobs are complete, examine the .out
and .err
files as well as the files in the sarek_output
folder.
The .out
file will contain progress and error messages and will indicate whether the pipeline completed successfully. The .err
file may contain error messages depending on the types of errors encountered.
The work
folder that is created by Nextflow contains the output of each step of the pipeline. The contents of this folder can be used to resume a pipeline run if it is interrupted.
If the pipeline doesn't complete within the requested time (3 hours) you can re-run it using the same command as before, and it will continue from where it left off. You can also double the time requested by adding --time=6:00:00
to the sbatch
command.
Note that if you log out of the cluster you will need to set the environment variables again before re-running the pipeline:
cd ~/scratch
source $HOME/nf-core-env/bin/activate
export NXF_SINGULARITY_CACHEDIR=~/scratch/singularity; \
export NFCORE_PL=sarek; export PL_VERSION=3.2.0
Running an nf-core pipeline on a full data set requires preparing the necessary input files and providing more information to the pipeline using command-line parameters. See the nf-core/sarek documentation for more information on how to do this with the sarek pipeline. You will also want to increase the time requested in the sarek.sbatch
file.
It isn't unusual for some jobs to fail when running a pipeline on a full data set. These failures can be due to a variety of reasons, including insufficient resources requested, or a problem with the input data.
The error messages in the .out
file serve as a starting point when trying to overcome failed runs. The -resume
option helps to minimize the impact of these failures by allowing the pipeline to resume from the point of failure.
Note that the scratch folder is regularly cleaned out by administrators. If you want to keep the pipeline results, move them to your project
or home
folder, or download them to your local computer.
salloc --time=2:0:0 --ntasks=2 --account=def-someuser --mem-per-cpu=8000M --mail-type=ALL --mail-user=your.email@example.com
sacct -s CD --format=JobID,JobName,MaxRSS,ReqMem,Elapsed,End,State,NodeList
scontrol show job -dd <jobid>
squeue -u <username>
squeue -u <username> -t PENDING
squeue -u <username> -t RUNNING
seff <jobid>
sort -t, input.csv
In the following examples a file with a single header line is sorted:
(head -n 1 sequenced_samples.csv && sort -t, <(tail -n +2 sequenced_samples.csv))
(head -n 1 sequenced_samples.csv && tail -n +2 sequenced_samples.csv | sort -t,)
cat sequenced_samples.csv | awk 'NR<2{print $0; next}{print $0| "sort -t','"}'
The above command can be modified to sort by the second column, numerically from smallest to largest:
cat sequenced_samples.csv | awk 'NR<2{print $0; next}{print $0| "sort -t',' -k2,2n"}'
This example uses the following input from the text file input.txt
:
chrX Other
chrY Information
MT !
chr1 Some
chr2 Data
chr3 Or
chr3 Annotation
chr10 Or
chr21 Any
To sort by chromosome use the following:
sort -k1,1 -V -f -s input.txt
The above generates the following output:
chr1 Some
chr2 Data
chr3 Or
chr3 Annotation
chr10 Or
chr21 Any
chrX Other
chrY Information
MT !
The -k
option of sort
is used to indicate that certain fields are to be used for sorting (the sorting is still applied to the entire line).
The following sorts only using column 5 (5,5
means "starting at column 5 and ending at column 5"), numerically, from smallest to largest:
(head -n 1 sequenced_samples.csv && \
tail -n +2 sequenced_samples.csv | sort -t, -k5,5n)
The following sorts using column 2 to the end of line, alphabetically:
(head -n 1 sequenced_samples.csv && sort -t, -k2 \
<(tail -n +2 sequenced_samples.csv))
The command below sorts using column 2, breaking ties using the value in column 5 (sorting numerically from largest to smallest). The input file has a single header row, which is excluded from the sort.
cat sequenced_samples.csv | awk \
'NR<2{print $0; next}{print $0| "sort -t, -k2,2 -k5,5nr"}'
The following creates three panes: one large one at the top and two smaller ones at the bottom:
tmux new-session -s multiple \; \
split-window -v -p 25 \; \
split-window -h -p 50 \; \
select-pane -t 0 \;
The following also creates a three-pane tmux session, and launches vim
in the largest pane:
tmux new-session -s multiple \; \
send-keys 'vim' C-m \; \
split-window -v -p 25 \; \
split-window -h -p 50 \; \
select-pane -t 0 \;
The following creates six equally sized panes:
tmux new-session -s multiple \; \
split-window -h \; \
split-window -v -p 66 \; \
split-window -v \; \
select-pane -t 0 \; \
split-window -v -p 66 \; \
split-window -v \;
Ctrl
-b
then d
.
tmux attach -t session_name
tmux kill-session -t session_name
tmux ls
Ctrl
-b
then an arrow key.
The following commands work with my configuration:
Ctrl
-j
moves up.
Ctrl
-k
moves down.
Ctrl
-h
moves left.
Ctrl
-l
moves right.
Ctrl
-b
then PgUp
. Press q
to return to normal mode.
Alternatively you can use Ctrl
-b
then [
and then navigation keys like Up Arrow
or PgDn
. Press q
to return to normal mode.
tmux new -s session_name
Use the -d
option with tr
to delete characters.
The following tr
command deletes all digits:
cat sequenced_samples.csv | tr -d "[:digit:]"
To delete everything except for digits using the following tr
command:
cat sequenced_samples.csv | tr -c -d "[:digit:]"
"Squeezing" is used here to mean "removing repeated instances". Typically this would be used to remove duplicated spaces or punctuation.
The following illustrates the removal extra commas by using tr
with the -s
option:
echo "a,b,,c,,,d" | tr -s ","
The following changes all S
and H
characters to s
and h
, respectively:
cat sequenced_samples.csv | tr "SH" "sh"
The sets of characters provided to tr
can be the actual characters (as in the above example) or they can be one of several special characters or character groups (use tr --help
to learn more).
For example, the following tr
command changes uppercase text to lowercase text:
cat sequenced_samples.csv | tr "[:upper:]" "[:lower:]"
To change all digits to X
use tr
as in the following example:
cat sequenced_samples.csv | tr "[:digit:]" "X"
The following tr
command changes all characters that aren't G
, A
, T
, or C
(ignoring case) or whitespace to N
:
echo "garsxnT" | tr -c "GATCgatc[:space:]" "N"
The above command uses the complement option (-c
) to convert the first set to everything that isn't in the set.
To add all tags, e.g. AC_Hom
, AC_Het
, etc:
bcftools +fill-tags input.vcf -o output.vcf
Use SnpEff to predict variant effects.
List available pre-built databases for annotation:
java -jar snpEff.jar databases
Download a database:
java -jar snpEff.jar snpEff download -v CanFam3.1.99
Annotate a VCF file:
java -jar snpEff.jar -Xmx8g CanFam3.1.99 input.vcf > input.ann.vcf
Alternatively, use Ensembl VEP to predict variant effects. VEP can be used to annotate structural variants.
SPECIES=canis_lupus
ASSEMBLY=CanFam3.1
INPUT=input.vcf
VEP_CACHE=vep
OUTDIR=vep_annotated
WD="$(pwd)"
export PERL5LIB="$PERL5LIB:$WD/$VEP_CACHE"
export DYLD_LIBRARY_PATH="$DYLD_LIBRARY_PATH:$WD/$VEP_CACHE/htslib"
# create cache directory and output directory
mkdir -p $OUTDIR
mkdir -p $VEP_CACHE
# build cache
vep_install -a cfp -s $SPECIES -y $ASSEMBLY \
-c $VEP_CACHE \
-d $VEP_CACHE \
--PLUGINS all --CONVERT
# run vep
SPECIES=canis_lupus_familiaris
vep --cache --format vcf --vcf \
--dir_cache $VEP_CACHE \
--dir_plugins $VEP_CACHE/Plugins \
--input_file $INPUT \
--output_file $OUTDIR/$INPUT \
--species $SPECIES --assembly $ASSEMBLY \
--max_sv_size 1000000000 \
--force_overwrite \
--plugin Blosum62 --plugin Downstream --plugin Phenotypes --plugin TSSDistance --plugin miRNA \
--variant_class --sift b --nearest gene --overlaps --gene_phenotype --regulatory --protein \
--symbol --ccds --uniprot --biotype --domains --check_existing --no_check_alleles --pubmed \
--verbose
Use SnpSift annotate
to add variant IDs.
In this example the file canis_lupus_familiaris.sorted.vcf
has variant IDs to be transferred to input.vcf
.
bgzip canis_lupus_familiaris.sorted.vcf
tabix -p vcf canis_lupus_familiaris.sorted.vcf.gz
java -jar SnpSift.jar annotate \
canis_lupus_familiaris.sorted.vcf.gz \
input.vcf > input.rsID.vcf
# vcftools adds .het extension to output automatically, so output becomes output.vcf.het
vcftools --vcf input.vcf --chr X --het --out output.vcf
# add column to vcftools output reporting percent heterozygous genotypes
awk -F$'\t' 'BEGIN{OFS="\t"}; {if(NR==1){print $0,"Percent HET"} else {print $0, ($4 - $2) / $4 * 100}}' output.vcf.het > output.vcf.percent_het
Use --check-sex in plink
.
First convert the VCF file to a plink binary fileset:
species=dog
pseudoX_start=340475
pseudoX_end=6642728
mkdir plink_bed
plink --vcf input.vcf --make-bed --allow-extra-chr \
--$species --split-x $pseudoX_start $pseudoX_end \
--out plink_bed/input
Next edit the .fam
file in plink_bed
to describe family relationships, sex, and case / control status.
The following is from the plink
documentation:
A text file with no header line, and one line per sample with the following six fields:
Family ID ('FID') Within-family ID ('IID'; cannot be '0') Within-family ID of father ('0' if father isn't in dataset) Within-family ID of mother ('0' if mother isn't in dataset) Sex code ('1' = male, '2' = female, '0' = unknown) Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control)
Use --check-sex
:
mkdir plink_check_sex
plink --bfile plink_bed/input --check-sex --allow-extra-chr \
--$species --out plink_check_sex/input
cat plink_check_sex/input.sexcheck
In the output 1
= male, 2
= female, other
= unknown.
First determine current order of samples:
bcftools query -l input.vcf > sample_order.txt
Next edit sample_order.txt
to reflect the desired sample order.
For example, change the contents from this:
M-9
M-10
M-15
M-16
M-2
To this:
M-2
M-9
M-10
M-15
M-16
Generate a VCF file with the new sample order:
bgzip input.vcf
tabix -p vcf input.vcf.gz
bcftools view -S sample_order.txt \
input.vcf.gz > input.revised_sample_order.vcf
Use the KING algorithm, which is implemented in vcftools
and is accessed using the --relatedness2
:
mkdir relatedness
vcftools --vcf input.vcf --not-chr X --max-missing-count 0 --relatedness2 \
--out relatedness/input.vcf
cat relatedness/input.vcf.relatedness2 | column -t
Note that when using this approach the source files must have the same sample columns appearing in the same order.
In this example the contents of snps.vcf
and indels.vcf
are combined.
bgzip snps.vcf
bgzip indels.vcf
tabix -p vcf snps.vcf.gz
tabix -p vcf indels.vcf.gz
bcftools concat --allow-overlaps \
snps.vcf.gz \
indels.vcf.gz \
-Oz -o snps_and_indels.vcf.gz
First remove header content:
grep -v '^##' input.vcf > input.tsv
Then use VisiData to convert the TSV file to an Excel file:
vd input.tsv -b -o input.xlsx
Use --geno-counts in plink2
:
plink2 --vcf snps.vcf --geno-counts 'cols=chrom,pos,ref,alt,homref,refalt,altxy,hapref,hapalt,missing' --allow-extra-chr --chr-set 95
Or use bcftools
. First split multiallelic sites into biallelic records using bcftools norm
:
bgzip snps.vcf
tabix -fp vcf snps.vcf.gz
bcftools norm -m-any snps.vcf.gz -Oz > snps.norm.vcf.gz
Then generate a CSV report providing genotype counts for each variant:
INPUT=snps.norm.vcf.gz
OUTPUT=snps.norm.vcf.genotype.counts.csv
paste \
<(bcftools view "$INPUT" | awk -F"\t" 'BEGIN {print "CHR\tPOS\tID\tREF\tALT"} !/^#/ {print $1"\t"$2"\t"$3"\t"$4"\t"$5}') \
<(bcftools query -f '[\t%SAMPLE=%GT]\n' "$INPUT" | awk 'BEGIN {print "nHomRef"} {print gsub(/0\|0|0\/0/, "")}') \
<(bcftools query -f '[\t%SAMPLE=%GT]\n' "$INPUT" | awk 'BEGIN {print "nHet"} {print gsub(/0\|1|1\|0|0\/1|1\/0/, "")}') \
<(bcftools query -f '[\t%SAMPLE=%GT]\n' "$INPUT" | awk 'BEGIN {print "nHomAlt"} {print gsub(/1\|1|1\/1/, "")}') \
| sed 's/,\t/\t/g' | sed 's/,$//g' > "$OUTPUT"
In some cases a good alternative is to use bcftools
to calculate additional INFO
tags from the genotypes. The following adds AC_Hom
and AC_Het
tags to the VCF file.
bgzip source.vcf
tabix -fp vcf source.vcf.gz
bcftools +fill-tags source.vcf.gz -Oz -o source.additional-fields.vcf.gz -- -t AC_Hom,AC_Het
Use --mendel in plink
.
First convert the VCF file to a plink binary fileset:
species=dog
pseudoX_start=340475
pseudoX_end=6642728
mkdir plink_bed
plink --vcf input.vcf --make-bed --allow-extra-chr \
--$species --split-x $pseudoX_start $pseudoX_end \
--out plink_bed/input
Next edit the .fam
file in plink_bed
to describe family relationships, sex, and case / control status.
The following is from the plink
documentation:
A text file with no header line, and one line per sample with the following six fields:
Family ID ('FID') Within-family ID ('IID'; cannot be '0') Within-family ID of father ('0' if father isn't in dataset) Within-family ID of mother ('0' if mother isn't in dataset) Sex code ('1' = male, '2' = female, '0' = unknown) Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control)
Use --mendel
:
mkdir plink_mendelian_errors
plink --bfile plink_bed/input --mendel --geno 0 --allow-extra-chr \
--$species --out plink_mendelian_errors/input
cat plink_mendelian_errors/input.imendel
grep -c -v '^#' input.ann.vcf
Or:
zgrep -c -v '^#' input.ann.vcf.gz
Use bcftools
:
bgzip A.vcf
tabix -fp vcf A.vcf.gz
bgzip B.vcf
tabix -fp vcf B.vcf.gz
bcftools gtcheck -g A.vcf.gz B.vcf.gz
Or use SnpSift:
java -jar SnpSift.jar concordance A.vcf B.vcf
Or use Picard. In this example the comparison is limited to the sample named HOCANF12689774
:
java -jar ~/bin/picard.jar GenotypeConcordance \
CALL_VCF=A.vcf \
CALL_SAMPLE=HOCANF12689774 \
O=gc_concordance.vcf \
TRUTH_VCF=B.vcf \
TRUTH_SAMPLE=HOCANF12689774
INPUT=SNPs.vcf
paste \
<(bcftools query -f '[%SAMPLE\t]\n' "$INPUT" | head -1 | tr '\t' '\n') \
<(bcftools query -f '[%GT\t]\n' "$INPUT" | awk -v OFS="\t" '{for (i=1;i<=NF;i++) if ($i == "./.") sum[i]+=1 } END {for (i in sum) print i, sum[i] / NR }' | sort -k1,1n | cut -f 2)
The above produces output like:
sample1 0.956705
sample2 0.124076
sample3 0.0281456
bgzip input.vcf
tabix -fp vcf input.vcf.gz
bcftools view -Oz --min-alleles 2 --max-alleles 2 --types snps --output-file biallelic-snp.vcf.gz input.vcf.gz
bgzip input.vcf
tabix -fp vcf input.vcf.gz
bcftools view -Oz --types indels --output-file indels.vcf.gz input.vcf.gz
In this example there are 5
VCF files in the directory vcfs
.
Prepare index files:
cd vcfs
find . -name "*.vcf" -exec bgzip {} \;
find . -name "*.vcf.gz" -exec tabix -p vcf {} \;
Determine the intersection (change 5
to match the number of input files):
mkdir overlaps
bcftools isec -p overlaps \
-n=5 -c all -w1 \
*.vcf.gz
mv overlaps/0000.vcf overlaps/intersection.vcf
Count the number of variants in overlaps/intersection.vcf
:
grep -c -v '^#' overlaps/intersection.vcf
Use the bcftools isec command.
Extract records private to A or B comparing by position only:
bcftools isec -p dir -n-1 -c all A.vcf.gz B.vcf.gz
Use the bcftools isec command.
Extract and write records from A shared by both A and B using exact allele match:
bcftools isec -p dir -n=2 -w1 A.vcf.gz B.vcf.gz
Use the bcftools isec command.
bcftools isec --complement -c some \
file1.vcf.gz \
file2.vcf.gz \
-p sites-in-file-1-not-in-file-2
The sites in file1.vcf.gz
that are not in file2.vcf.gz
can be found in sites-in-file-1-not-in-file-2/0000.vcf
.
The -c some
causes sites to be considered equivalent when some subset of ALT alleles match. To require identical REF and ALT alleles use -c none
. To match based only on position, use -c all
.
bgzip input.vcf
tabix -fp vcf input.vcf.gz
bcftools view -Oz --types snps --output-file snp.vcf.gz input.vcf.gz
Note that if the VCF file is gzip compressed (i.e. has a .gz
extension), use --gzvcf
instead of --vcf
.
vcftools --vcf Chr5.vcf --out Chr5_filtered --chr 5 --from-bp 1 --to-bp 100000 --recode --recode-INFO-all
bgzip Chr5.vcf
tabix -p vcf Chr5.vcf.gz
bcftools view -r 5:1-10000,5:200000-210000 -o output.vcf Chr5.vcf.gz
In this example the regions of interest are stored in a text file called regions.txt
. Each line describes the chromosome, start, and end of a region:
3 62148416 62200719
4 54643953 54720351
4 63732381 63795159
5 10163746 10218801
5 10784272 10841310
bgzip input.vcf
tabix -p vcf input.vcf.gz
tabix --print-header -R regions.txt input.vcf.gz > regions_of_interest.vcf
Use SnpSift to filter VCF files.
The following keeps variants that have a FILTER
value of PASS
:
cat input.vcf | SnpSift filter "( FILTER = 'PASS' )" > input.PASS.vcf
Or, use bcftools
:
bcftools view -f PASS input.vcf > input.PASS.vcf
awk -F $'\t' '$1 ~ /^#/ {print; next} $3~/^\./' input.vcf > input.noID.vcf
awk -F $'\t' '$1 ~ /^#/ {print; next} $3~/^\./ {next} {print}' input.vcf > input.ID.vcf
Use bcftools view to filter sites based on an expression. Sites for which the expression is true can be kept using the -i
option or excluded using the -e
option.
In the following example, sites are excluded (-e
) if any sample is homozygous for an alternate allele and has a genotype quality greater than 30 (note that the &
operator is used to indicate that both conditions must be met within a single sample):
bcftools view -e 'GT[*]="AA" & GQ[*]>30' SNPs.vcf
In the following example, sites are kept (-i
) if the first sample is heterozygous with one reference allele and one alternate allele and the second sample is homozygous for an alternate allele (note that the &&
operator is used to indicate that the conditions can be met in different samples):
bcftools view -i 'GT[0]="RA" && GT[1]="AA"' SNPs.vcf
SnpSift can also be used to filter VCF files based on sample genotypes.
The following approach can be used to exclude sites where any sample meets the following criteria: is homozygous and the genotype quality is greater than 30
and the genotype is not 0/0
. Worded another way, a site is kept if no sample exhibits a good-quality homozygous alternate genotype.
In this example there are 6 samples in the VCF file.
First generate the filter string:
FILTER=$(perl -e '@array = (); foreach(0..5) {push @array, "( isHom( GEN[$_] ) & GEN[$_].GQ > 30 & isVariant( GEN[$_] ) )";} print " !( " . join(" | ", @array) . " )\n";')
Examine the filter string:
echo $FILTER
This produces:
!( ( isHom( GEN[0] ) & GEN[0].GQ > 30 & isVariant( GEN[0] ) ) | ( isHom( GEN[1] ) & GEN[1].GQ > 30 & isVariant( GEN[1] ) ) | ( isHom( GEN[2] ) & GEN[2].GQ > 30 & isVariant( GEN[2] ) ) | ( isHom( GEN[3] ) & GEN[3].GQ > 30 & isVariant( GEN[3] ) ) | ( isHom( GEN[4] ) & GEN[4].GQ > 30 & isVariant( GEN[4] ) ) | ( isHom( GEN[5] ) & GEN[5].GQ > 30 & isVariant( GEN[5] ) ) )
Use the filter string and SnpSift.jar
to complete the filtering step (the set +H
is used so that the !
in the filter string doesn't activate Bash history expansion):
set +H
cat snps.vcf | java -jar SnpSift.jar filter "$FILTER" > snps_with_no_homozygous_alt_genotypes.vcf
set -H
Use SnpSift to filter VCF files that have been annotated using SnpEff.
The following keeps variants that are predicted to have HIGH
or MODERATE
impacts:
cat input.ann.vcf | SnpSift filter "((ANN[*].IMPACT = 'HIGH') | (ANN[*].IMPACT = 'MODERATE'))" > input.ann.high_or_moderate.vcf
Use --homozyg in plink
.
First convert the VCF file to a plink binary fileset:
species=dog
pseudoX_start=340475
pseudoX_end=6642728
mkdir plink_bed
plink --vcf input.vcf --make-bed --allow-extra-chr \
--$species --split-x $pseudoX_start $pseudoX_end \
--out plink_bed/input
Next edit the .fam
file in plink_bed
to describe family relationships, sex, and case / control status.
The following is from the plink
documentation:
A text file with no header line, and one line per sample with the following six fields:
Family ID ('FID') Within-family ID ('IID'; cannot be '0') Within-family ID of father ('0' if father isn't in dataset) Within-family ID of mother ('0' if mother isn't in dataset) Sex code ('1' = male, '2' = female, '0' = unknown) Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control)
Next, create a text file containing the Family ID and Within-family ID of each sample to be included in the analysis.
In this example the file roh_samples.txt
consists of:
family1 M-2
family4 M-9
family3 M-10
family2 M-15
family2 M-16
family1 M-3
family1 M-4
family1 M-5
Use --homozyg
:
mkdir plink_roh
cd plink_roh
plink --bfile ../plink_bed/input \
--homozyg group extend \
--$species \
--keep roh_samples.txt \
--allow-extra-chr \
--homozyg-snp 50 \
--homozyg-kb 100 \
--homozyg-density 2 \
--homozyg-gap 100 \
--homozyg-window-snp 50 \
--homozyg-window-het 2 \
--homozyg-window-missing 10 \
--homozyg-window-threshold 0.05
The plink.hom.overlap
file in plink_roh
can be filtered to, for example, obtain regions of homozygosity that are found in 5
cases and 0
controls:
awk '{ if ($2 == "CON" && $4 == "5:0") print $5" "$8" "$9 }' plink.hom.overlap > plink.hom.overlap.filtered
Variants in the filtered regions of homozygosity can be extracted from the VCF file:
bgzip input.vcf
tabix -p vcf input.vcf.gz
tabix --print-header -R plink_roh/plink.hom.overlap.filtered \
input.vcf.gz > input.hom.vcf
Suppose a site has one ALT allele and the following genotype counts: 4627 homozygous REF, 429 heterozygous, and 60 homozygous ALT.
The resulting tag values are as follows:
- AC = 549 = 429 + (60 * 2); this is the number of ALT alleles in the genotypes.
- AN = 10232 = (4627 + 429 + 60) * 2; this is the number of alleles in the genotypes.
- AF = AC / AN = 549 / 10232 = 0.054; this is the frequency of the ALT allele.
- AC_Het = 429 = the number of ALT alleles in heterozygous genotypes = the number of heterozygous genotypes.
- AC_Hom = 120 is the number of ALT alleles in homozygous genotypes = 2 * the number of homozygous ALT genotypes.
- MAF = 0.054 = the lesser of AF or 1 - AF.
- NS = 5116 = AN / 2.
In this example the samples to keep are in a text file called sample_list.txt
, one sample per line. The --min-ac
option is used to remove sites that are monomorphic in the subset of samples:
bcftools view input.vcf.gz -Oz --samples-file sample_list.txt --min-ac=1 --output-file output.vcf.gz
The following is used to create a new VCF file containing sites from chromosomes 1
to 18
and X
and MT
. The VCF header is then updated using a sequence dictionary from a reference genome.
The starting VCF is input.vcf.gz
. The reference genome is reference.fa
. The final output VCF is filtered.reheadered.vcf.gz
.
This uses GATK, bcftools
, and tabix
.
# create index files incase they don't exist
samtools faidx reference.fa
tabix -p vcf input.vcf.gz
# create sequence dictionary
docker pull broadinstitute/gatk
docker run -it --rm \
-u "$(id -u)":"$(id -g)" \
-v "$(pwd)":/directory \
-w /directory \
broadinstitute/gatk \
gatk CreateSequenceDictionary\
-R reference.fa \
-O reference.dict
# extract sites from chromosomes 1 to 18, X, and MT
bcftools view -r 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,X,Y,MT \
input.vcf.gz -o filtered.vcf.gz -O z
tabix -p vcf filtered.vcf.gz
# reheader VCF using sequence dictionary
docker run -it --rm \
-u "$(id -u)":"$(id -g)" \
-v "$(pwd)":/directory \
-w /directory \
broadinstitute/gatk \
gatk UpdateVCFSequenceDictionary \
-V filtered.vcf.gz \
-R reference.fa \
--output filtered.reheadered.vcf.gz \
--replace true
tabix -p vcf filtered.reheadered.vcf.gz
Use bcftools view to filter sites based on an expression. Sites for which the expression is true can be kept using the -i
option or excluded using the -e
option.
In the following example, sites are kept (-i
) if all 10
samples are homozygous for the reference allele:
bcftools view -i 'COUNT(GT="RR")=10' variants.vcf.gz
In the following example, sites are removed (-e
) if any sample contains an alternate allele:
bcftools view -e 'GT[*]="alt"' variants.vcf.gz
From the bcftools
merge
documentation:
Merge multiple VCF/BCF files from non-overlapping sample sets to create one multi-sample file. For example, when merging file A.vcf.gz containing samples S1, S2 and S3 and file B.vcf.gz containing samples S3 and S4, the output file will contain four samples named S1, S2, S3, 2:S3 and S4.
bcftools merge *.vcf.gz -Oz -o merged.vcf.gz
In this example VCF files are merged in batches by submitting jobs to Slurm using sbatch
. The advantage of this approach is that bcftools merge
commands are run in parallel, so that results can be obtained more quickly.
The input files for this example are as follows:
21297.haplotypecaller.vcf.gz 23019.haplotypecaller.vcf.gz 23991.haplotypecaller.vcf.gz
21297.haplotypecaller.vcf.gz.tbi 23019.haplotypecaller.vcf.gz.tbi 23991.haplotypecaller.vcf.gz.tbi
21987.haplotypecaller.vcf.gz 23955.haplotypecaller.vcf.gz 24315.haplotypecaller.vcf.gz
21987.haplotypecaller.vcf.gz.tbi 23955.haplotypecaller.vcf.gz.tbi 24315.haplotypecaller.vcf.gz.tbi
First create an sbatch
script called merge-vcfs.sbatch
(replace someuser
with your username):
#!/bin/bash
#SBATCH --account=def-someuser
#SBATCH --time=0:30:0
#SBATCH --ntasks=1
#SBATCH --mem=1000M
module load bcftools
module load tabix
output="$1"
shift
bcftools merge "$@" -Oz -o $output.merged.vcf.gz
tabix -p vcf $output.merged.vcf.gz
Note that the resource requirements may need to be adjusted depending on the number of input files to be processed in each group and their sizes.
merge-vcfs.sbatch
uses bcftools
to merge the input files and tabix
to index the resulting merged VCF file. The first argument to merge-vcfs.sbatch
is used to construct the output file name and the remaining arguments are used as the input files.
parallel
can then be used to submit jobs that apply merge-vcfs.sbatch
to batches of input files. The --dryrun
option causes parallel
to print out the commands instead of running them. The --delay 1
inserts a one second delay between printing or running jobs:
# You will likely want to increase batch_size to something like 50
batch_size=4
parallel --dryrun --delay 1 -j 1 \
-N $batch_size "sbatch -o {#}.merged.out -e {#}.merged.err merge-vcfs.sbatch {#} {} " ::: *.haplotypecaller.vcf.gz
The ::: *.haplotypecaller.vcf.gz
and batch_size=4
lead to one sbatch
command being constructed per group of four .haplotypecaller.vcf.gz
files in the current directory. The final group may contain fewer than four files depending on the total number of input files, which is fine.
Each instance of {#}
gets replaced with the job sequence number (1
, 2
, 3
, etc.), and is used to construct unique filenames for each group of input files. The {}
is replaced with the names of the files in the current group.
To submit the jobs, run the parallel
command again, but without the --dryrun
option:
parallel --delay 1 -j 1 \
-N $batch_size "sbatch -o {#}.merged.out -e {#}.merged.err merge-vcfs.sbatch {#} {} " ::: *.haplotypecaller.vcf.gz
Once the jobs are submitted their statuses can be checked using squeue
(replace username
with your username):
squeue -u username
Each job creates four files, for example:
1.merged.err
1.merged.out
1.merged.vcf.gz
1.merged.vcf.gz.tbi
The .out
files and .err
files are expected to be empty for these jobs. To quickly check the .err
files, which contain any errors or warnings generated by the jobs, use:
cat *.err | more
Once all the jobs are complete, the output VCF files can be merged again to produce the final VCF file. For use in this merging step it may be necessary to construct a modified version of merge-vcfs.sbatch
that requests additional resources. In this example we will use the same merge-vcfs.sbatch
script as before:
sbatch -o final.merged.out -e final.merged.err merge-vcfs.sbatch final *.merged.vcf.gz
The final merged file will be named final.merged.vcf.gz
.
In this example VCF files are merged in batches by submitting one job array to Slurm using sbatch
. Using a job array instead of a large number of separate serial jobs can make it easier to monitor jobs and can make the scheduler run more efficiently.
The input files for this example are as follows:
21297.haplotypecaller.vcf.gz 23019.haplotypecaller.vcf.gz 23991.haplotypecaller.vcf.gz
21297.haplotypecaller.vcf.gz.tbi 23019.haplotypecaller.vcf.gz.tbi 23991.haplotypecaller.vcf.gz.tbi
21987.haplotypecaller.vcf.gz 23955.haplotypecaller.vcf.gz 24315.haplotypecaller.vcf.gz
21987.haplotypecaller.vcf.gz.tbi 23955.haplotypecaller.vcf.gz.tbi 24315.haplotypecaller.vcf.gz.tbi
First create an sbatch
script called merge-vcfs-job-array.sbatch
(replace someuser
with your username):
#!/bin/bash
# SLURM directives
#SBATCH --account=def-someuser
#SBATCH --time=0:30:0
#SBATCH --ntasks=1
#SBATCH --mem=1000M
# Load required modules
module load bcftools
module load tabix
# Script arguments
FILE_LIST="$1"
BATCH_SIZE="$2"
OUTPUT_PREFIX="$3"
# Compute index based on SLURM_ARRAY_TASK_ID
INDEX=$((SLURM_ARRAY_TASK_ID * BATCH_SIZE))
OUTPUT="${SLURM_ARRAY_TASK_ID}.${OUTPUT_PREFIX}"
# Initialize empty array for input files
INPUT_FILES=()
# Populate INPUT_FILES array based on FILE_LIST and computed index
for ((i=0; i<BATCH_SIZE; i++)); do
FILE_TO_ADD="$(sed -n "$((INDEX + i + 1))p" $FILE_LIST)"
if [[ -n $FILE_TO_ADD ]]; then
INPUT_FILES+=("$FILE_TO_ADD")
fi
done
# Merge VCF files using bcftools and index the result with tabix
bcftools merge "${INPUT_FILES[@]}" -Oz -o $OUTPUT.vcf.gz
tabix -p vcf $OUTPUT.vcf.gz
Note that the resource requirements may need to be adjusted depending on the number of input files to be processed in each group and their sizes.
Now perform some calculations and submit the job array:
# This computes the ceiling of total_files/batch_size
# You will likely want to increase batch_size to something like 50
batch_size=4
total_files=$(ls *.haplotypecaller.vcf.gz | wc -l)
num_tasks=$(( (total_files + batch_size - 1) / batch_size ))
echo "Will submit $num_tasks tasks for $total_files files"
# Create file list
ls *.haplotypecaller.vcf.gz > file_list.txt
# Submit the job array
sbatch --array=0-$((num_tasks - 1)) \
-o %a.merged.out \
-e %a.merged.err \
merge-vcfs-job-array.sbatch file_list.txt $batch_size merged
Once the job array is submitted its status can be checked using squeue
(replace username
with your username):
squeue -u username
Each task in the job creates four files, for example:
0.merged.err
0.merged.out
0.merged.vcf.gz
0.merged.vcf.gz.tbi
The .out
files and .err
files are expected to be empty for these jobs. To quickly check the .err
files, which contain any errors or warnings generated by the jobs, use:
cat *.err | more
Once all the tasks are complete, the output VCF files can be merged again to produce the final VCF file. For use in this merging step it may be necessary to construct a modified version of merge-vcfs-job-array.sbatch
that requests additional resources. In this example we will use the same merge-vcfs-job-array.sbatch
script as before:
# Calculate number of files to merge
final_merge_batch_size=$(ls *.merged.vcf.gz | wc -l)
echo "Will submit 1 task for $final_merge_batch_size files"
# Create file list for final merge
ls *.merged.vcf.gz > final_merge_file_list.txt
# Submit a job array with a single task
sbatch --array=0 \
-o %a.final.merged.out \
-e %a.final.merged.err \
merge-vcfs-job-array.sbatch final_merge_file_list.txt $final_merge_batch_size final.merged
The final merged file will be named 0.final.merged.vcf.gz
.
SnpSift can generate p-values for different models.
In this example the +++++
specifies that the first five samples are cases. The -------
specifies that the next seven samples are controls. The 000
specifies that the last three samples should be ignored.
cat input.vcf | java -jar SnpSift.jar caseControl "+++++-------000" > input.case-control.vcf
The results can be filtered based on p-value. The following keeps variants where the p-value under the recessive model is less than 0.001
:
cat input.case-control.vcf | java -jar SnpSift.jar filter "CC_REC[0] < 0.001" > input.case-control.sig.vcf
bcftools query -l input.vcf.gz
Or:
bcftools query -l input.vcf
In this example sample GM-2
is removed:
vcftools --remove-indv GM-2 --vcf input.vcf --recode --out output.vcf
bgzip input.vcf
tabix -p vcf input.vcf.gz
bcftools view -G input.vcf.gz -Oz -o output.vcf.gz
Or:
bcftools view -G input.vcf > output.vcf
Use bcftools
annotate
.
The following removes all INFO
fields and all FORMAT
fields except for GT
and GQ
:
bcftools annotate -x INFO,^FORMAT/GT,FORMAT/GQ input.vcf > input_GT_GQ.vcf
Use bcftools view to filter sites based on an expression. Sites for which the expression is true can be kept using the -i
option or excluded using the -e
option.
In the following example, sites are removed (-e
) if all 10
samples are homozygous for the reference allele:
bcftools view -e 'COUNT(GT="RR")=10' variants.vcf.gz
In the following example, sites are kept (-i
) if any sample contains an alternate allele:
bcftools view -i 'GT[*]="alt"' variants.vcf.gz
bcftools view -e 'GT[*]="mis"' input.vcf > output.vcf
In this example the new sample names are in the file new_sample_names.txt
, one name per line, in the same order as they appear in the VCF file:
bcftools reheader input.vcf.gz --samples new_sample_names.txt --output output.vcf.gz
In this example some annotations are added to the file source.vcf
(calculated from genotypes) and then annotations are transferred to the file input.vcf
. The results are written to output.vcf
.
The transfer of annotations is done using vcfanno.
This procedure is useful for annotating variants in one file with information obtained, for example, from a much larger population of samples. The resulting tags can be used in downstream filtering operations.
First fill some additional annotation fields to source.vcf
:
bgzip source.vcf
tabix -fp vcf source.vcf.gz
bcftools +fill-tags source.vcf.gz -Oz -o source.additional-fields.vcf.gz -- -t AC_Hom,AC_Het
Create a config.toml
file for vcfanno:
[[annotation]]
file="source.additional-fields.vcf.gz"
fields=["AF", "AC_Hom", "AC_Het"]
names=["source.AF", "source.AC_Hom", "source.AC_Het"]
ops=["self", "self", "self"]
Prepare the input.vcf
file:
bgzip input.vcf
tabix -fp vcf input.vcf.gz
Create an index for source.additional-fields.vcf.gz
:
tabix -fp vcf source.additional-fields.vcf.gz
Perform the annotation transfer:
vcfanno config.toml input.vcf.gz > out.vcf
The file out.vcf
should now include INFO
tags source.AF
, source.AC_Hom
, and source.AC_Het
with values calculated from the samples in source.vcf
.
To check the value of a setting, in this example the paste
setting, use:
:set paste?
Or:
echo &paste
vimdiff file1 file2
"+y
First, turn on paste mode so that auto-indenting is turned off:
:set paste
Now paste the text.
Then turn off paste mode:
:set nopaste
To allow quick toggling of paste mode using F2, add the following to .vimrc
:
set pastetoggle=<F2>
Now you can avoid auto-indenting of pasted text as follows: press F2, paste the text, press F2 again.
:%s/\s\+$//e
In this example search and replace operations are performed on all the .html
files in a directory. First, open the files in multiple buffers in vim:
vim *.html
Then use argdo
to perform a search and replace across all the files. In this example blank lines are removed:
:argdo %s/^$//ge
In this example the text between <p class="last-updated">
and </p>
are replaced with the current date. Note the use of \zs
and \ze
so that the text between those tags is replaced and not the tags themselves:
:argdo %s/<p class="last-updated">\zs[^<]*\ze<\/p>/\=strftime("%c")/ge
In replacement syntax use \r
instead of \n
to represent newlines. For example, to replace commas with newlines:
:%s/,/\r/g
In insert mode type Ctrl
-v
then tab
.
:e ++ff=unix