-
Notifications
You must be signed in to change notification settings - Fork 1
Writing clean bash scripts
While you don't need to know everything about sh
or bash
to write functional code, you'll find that like R
, there are certain ways to write code that end up making life easier for you. This tutorial is meant to share some of those methods so you can have less headaches along the way. Remember to make your shell scripts executable with chmod +x <script.name>
The things mentioned afterwards in the tutorial center around using "wrapper" scripts instead of executing things directly via command line. A wrapper (or shell) script is just an executable text file that contains the commands you want to run in the order you want to run them. So here's an example of a two direct commands that you might enter one after another:
cd /home/user/Music/
mkdir ./Madonna_Greatest_Hits
the wrapper script (lets call it madonna.sh
) would be a text file that can look like:
# file: madonna.sh
# this script will create an empty folder for Madonna's Greatest Hits in your /Music folder
cd /home/user/Music/ # change directories
mkdir ./Madonna_Greatest_Hits # make the empty folder
which you would execute with
./madonna.sh
So here are the immediate benefits to the wrapper script:
- executing
madonna.sh
will automate all the commands you put within it - you can annotate inside the file, which means you don't need to remember every single little detail weeks, months, or years down the line.
So let's add to this concept with things that are important and relevant
In coding terminology, a "shebang" is a line at the top of a script that informs the computer what language interpreter it needs to use to correctly run your script. In other words, it tells the computer what language your code is in. All the coding languages on a computer are installed somewhere, so a shebang at the top of your script points to the language's location and cuts out the guesswork. If a system can't find the language, the script cannot be run. Computers by default don't "speak" all the coding languages (but common ones are usually installed by default), so you may need to occasionally install a language. The tricky thing is, different Unix-like operating systems (macOS, the different kinds of Linux) store those languages in different places due to personal preference. Conveniently, for bash
or sh
scripts, the "universal" shebang is #!/usr/bin/env bash
, which should work on most/all systems. It should be similar for ruby
(e.g. #!/usr/bin/env ruby
) or python (e.g. #!/usr/bin/env python
). So let's apply that to madonna.sh
:
#!/usr/bin/env bash
# file: madonna.sh
# this script will create an empty folder for Madonna's Greatest Hits in your /Music folder
cd /home/user/Music/ # change directories
mkdir ./Madonna_Greatest_Hits # make the empty folder
Now that we have that shebang there, we don't need to call the script madonna.sh
, and if we wanted to, could rename it to just make_madonna_folder
without an extension, since the script now indicates what language the computer need to use. These shebangs are how you can call up scripts from your $PATH
that don't have filename extensions, like dDocent
, canu
, or apt
.
Fact is, you probably wont need to automate making a bunch of folders for great 80s music. The kinds of things we run, like structure
, pilon
, vcftools
, require a bunch of arguments to tweak run parameters, along with input and output filenames. Often times, you will need to run and rerun these scripts to get them just right, each time tweaking some things here and there. The most time-saving way to do that is by using variables. For bash
scripts, it's as simple as adding the variables between the shebang and the actual code with VARIABLE=VALUE
, then referencing it later in the code by adding a dollar sign in front of the variable. For example, if we were to set SPICEGIRLS=1997
, then you would reference it with $SPICEGIRLS
.
Let's apply this idea to the actual code of an actual analysis. First, let's look at what a direct command line execution of a particular script looks like:
sh ~/DBG2OLC/DBG2OLC k 17 AdaptiveTh 0.0001 KmerCovTh 2 MinOverlap 20 Contigs Contigs.txt f ~/assemblytest/Pacbio.fasta RemoveChimera 1
There's a lot of components to that, isn't there? There are several flags and arguments, directories for data, etc. It's kind of sloppy, and it doesn't make for good legibility, and changing bits will be kind of annoying. So, let's convert that into a shell script called assemblytest.sh
:
#!/usr/bin/env bash
KMER=17 # number of overlaps necessary
THRESH_VALUE=0.0001 # error threshold
KCOV=2 # convergence factor
OVERLAP=20 # number of overlapping bases to look at
CONTIGS=Contigs.txt # input contig file
PACBIO=~/assemblytest/Pacbio.fasta # input PacBio fasta file
~/DBG2OLC/DBG2OLC k $KMER \
AdaptiveTh $THRESH_VALUE \
KmerCovTh $KCOV \
MinOverlap $OVERLAP \
Contigs $CONTIGS \
f $PACBIO \
RemoveChimera 1
- You see the shebang at the top, so both we and the system know it's a
bash
script. - We have created a bunch of variables up at the top, each separated by a line and annotated for outselves.
- Each flag (e.g.
AdaptiveTh
orKmerCovTh
) is followed by referencing a variable that was set at the top. As an example,MinOverlap $OVERLAP
is the same asMinOverlap 20
because we setOVERLAP=20
above. - Each of the arguments for the
DBG2OLC
script is separated by\
and begins on a new line, which increases code legibility substantially. The\
at the end of each line lets us start a new line without a break in the command.
./assemblytest.sh