Writing clean bash scripts

While you don't need to know everything about sh or bash to write functional code, you'll find that like R, there are certain ways to write code that end up making life easier for you. This tutorial is meant to share some of those methods so you can have less headaches along the way. Remember to make your shell scripts executable with chmod +x <script.name>

1. Wrapping things up

The things mentioned afterwards in the tutorial center around using "wrapper" scripts instead of executing things directly via command line. A wrapper (or shell) script is just an executable text file that contains the commands you want to run in the order you want to run them. So here's an example of a two direct commands that you might enter one after another:

cd /home/user/Music/
mkdir ./Madonna_Greatest_Hits

the wrapper script (lets call it madonna.sh) would be a text file that can look like:

# file: madonna.sh
# this script will create an empty folder for Madonna's Greatest Hits in your /Music folder

cd /home/user/Music/              # change directories
mkdir ./Madonna_Greatest_Hits     # make the empty folder

which you would execute with

./madonna.sh

So here are the immediate benefits to the wrapper script:

executing madonna.sh will automate all the commands you put within it
you can annotate inside the file, which means you don't need to remember every single little detail weeks, months, or years down the line.

ok, but making empty folders to fill with Madonna's Greatest Hits isn't relevent to what you do

So let's add to this concept with things that are important and relevant

2. Adding a shebang

In coding terminology, a "shebang" is a line at the top of a script that informs the computer what language interpreter it needs to use to correctly run your script. In other words, it tells the computer what language your code is in. All the coding languages on a computer are installed somewhere, so a shebang at the top of your script points to the language's location and cuts out the guesswork. If a system can't find the language, the script cannot be run. Computers by default don't "speak" all the coding languages (but common ones are usually installed by default), so you may need to occasionally install a language. The tricky thing is, different Unix-like operating systems (macOS, the different kinds of Linux) store those languages in different places due to personal preference. Conveniently, for bash or sh scripts, the "universal" shebang is #!/usr/bin/env bash, which should work on most/all systems. It should be similar for ruby (e.g. #!/usr/bin/env ruby) or python (e.g. #!/usr/bin/env python). So let's apply that to madonna.sh:

#!/usr/bin/env bash

# file: madonna.sh
# this script will create an empty folder for Madonna's Greatest Hits in your /Music folder

cd /home/user/Music/              # change directories
mkdir ./Madonna_Greatest_Hits     # make the empty folder

Now that we have that shebang there, we don't need to call the script madonna.sh, and if we wanted to, could rename it to just make_madonna_folder without an extension, since the script now indicates what language the computer need to use. These shebangs are how you can call up scripts from your $PATH that don't have filename extensions, like dDocent, canu, or apt.

3. Let's talk variables

Fact is, you probably wont need to automate making a bunch of folders for great 80s music. The kinds of things we run, like structure, pilon, vcftools, require a bunch of arguments to tweak run parameters, along with input and output filenames. Often times, you will need to run and rerun these scripts to get them just right, each time tweaking some things here and there. The most time-saving way to do that is by using variables. For bash scripts, it's as simple as adding the variables between the shebang and the actual code with VARIABLE=VALUE, then referencing it later in the code by adding a dollar sign in front of the variable. For example, if we were to set SPICEGIRLS=1997, then you would reference it with $SPICEGIRLS.

Let's apply this idea to the actual code of an actual analysis. First, let's look at what a direct command line execution of a particular script looks like:

sh ~/DBG2OLC/DBG2OLC k 17 AdaptiveTh 0.0001 KmerCovTh 2 MinOverlap 20 Contigs Contigs.txt f ~/assemblytest/Pacbio.fasta RemoveChimera 1

There's a lot of components to that, isn't there? There are several flags and arguments, directories for data, etc. It's kind of sloppy, and it doesn't make for good legibility, and changing bits will be kind of annoying. So, let's convert that into a shell script called assemblytest.sh:

#!/usr/bin/env bash

KMER=17                               # number of overlaps necessary
THRESH_VALUE=0.0001                   # error threshold
KCOV=2                                # convergence factor
OVERLAP=20                            # number of overlapping bases to look at
CONTIGS=Contigs.txt                   # input contig file
PACBIO=~/assemblytest/Pacbio.fasta    # input PacBio fasta file

~/DBG2OLC/DBG2OLC  k $KMER \
AdaptiveTh $THRESH_VALUE \
KmerCovTh $KCOV \
MinOverlap $OVERLAP \
Contigs $CONTIGS \
f $PACBIO \
RemoveChimera 1

Let's take a monent to dissect this script, because there are some goodies in here.

You see the shebang at the top, so both we and the system know it's a bash script.
We have created a bunch of variables up at the top, each separated by a line and annotated for outselves.
Each flag (e.g. AdaptiveTh or KmerCovTh) is followed by referencing a variable that was set at the top. As an example, MinOverlap $OVERLAP is the same as MinOverlap 20 because we set OVERLAP=20 above.
Each of the arguments for the DBG2OLC script is separated by \ and begins on a new line, which increases code legibility substantially. The \ at the end of each line lets us start a new line without a break in the command.

So instead of running that clunky single-line command, we would just run our shell script as:

./assemblytest.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly