-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathLinux_Tutorial_2.txt
1517 lines (1088 loc) · 75.6 KB
/
Linux_Tutorial_2.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Linux Tutorial 2 - Intro to Linux, bash, Perl, R
================================================
by Harry Mangalam <harry.mangalam@uci.edu>
v1.5 - March 12, 2014
:icons:
// fileroot="/home/hjm/nacs/Linux_Tutorial_2"; asciidoc -a icons -a toc2 -b html5 -a numbered ${fileroot}.txt; scp ${fileroot}.html ${fileroot}.txt moo:~/public_html/biolinux; ssh -t moo 'scp ~/public_html/biolinux/Linux_Tutorial_2.[ht]* root@hpc.oit.uci.edu:/data/hpc/www/biolinux/'
// ssh -t moo 'scp ~/public_html/biolinux/Linux_Tutorial_2.[ht]* hmangala@hpc.oit.uci.edu:/data/hpc/www/biolinux/'
// scp ${fileroot}.html ${fileroot}.txt hmangala@hpc.oit.uci.edu:/data/hpc/www/biolinux;
The latest version of this document http://moo.nac.uci.edu/~hjm/biolinux/Linux_Tutorial_2.html[will always be found here].
== Introduction
This tutorial is a slightly more advanced follow-on to the
http://moo.nac.uci.edu/~hjm/biolinux/Linux_Tutorial_1.html[previous self-paced tutorial session] and extends the basic shell commands described therein to some basic data processing with bash, writing 'qsub' scripts, Perl, and the Open Source R package.
As such, it assumes that you're using a Linux system from the
bash shell, you're familiar navigating directories on a Linux system
using *cd*, using the basic shell utilities such as *head, tail, less, and
grep*, and the minimum R system has been installed on your system. If not,
you should peruse a basic introduction to the bash shell. The bash prompt
in this document is shown as *bash $*. The R shell prompt is *>* and R commands will be
prefixed by the *>* to denote it. Inline comments are prefixed by *#* and
can be copied into your shell along with the R or bash commands they
comment - the *#* shields them from being executed, but do NOT copy in the
R or bash shell prompt so mouse carefully.
For a quick introduction to basic data manipulation with Linux, please refer to the imaginatively named
http://moo.nac.uci.edu/~hjm/ManipulatingDataOnLinux.html[Manipulating Data on Linux].
It describes in more detail why you might want to know about various Linux text manipulation utilities (to repair data incompatibilities, to scan data sets for forbidden characters, missing values, unexpected negative or positive values - the same things you could do with a spreadsheet and a mouse but MUCH faster.
== The basics
These are the most frequently used general purpose commands that you'll be using (other than domain-specific applications)
=== Logging In, Out, Setting a Display, Moving Around
==== Local vs Remote
Your laptop is local; HPC is remote. It's easy to get confused on Linux and a Mac sometimes, especially when executing commands which have a local and remote end, like 'scp', 'rsync', and remote displays such as 'X11, vnc, and x2go'.
Also, when you're submitting 'qsub' jobs, remember that 'local' refers to the compute node on which your code is running, not the login node or the storage nodes.
[[x11x2go]]
==== X11 graphics, vnc, nx, x2go
Linux uses the http://en.wikipedia.org/wiki/X11[X11 windowing system] to display graphics. As such it needs a program on your local laptop to interpret the X11 data and render them into pixels. All Linux systems come with this already, since it uses X11 as the native display engine. MacOSX also has an X11 display package, now called http://xquartz.macosforge.org/landing/[XQuartz], which has to be started before you can display X11 graphics. Such systems also exist for Windows, but a better alternative for all of them (if you can get it to work) is the http://wiki.x2go.org[x2go] system which is more efficient than native X11 (which is a very chatty protocol and its performance decays rapidly with network hops).
- Linux: the http://wiki.x2go.org/doku.php/doc:installation:x2goclient[x2go client] seems to work fine, altho the installation will change depending on your distro.
- MacOSX: the http://code.x2go.org/releases/binary-macosx/x2goclient/releases/3.99.2.1/x2goclient-3.99.2.1.dmg[previous version of x2go] seems to work fine *as long as you start XQuartz 1st*.
- Windows: the http://code.x2go.org/releases/binary-win32/x2goclient/releases/4.0.0.3/x2goclient-4.0.0.3-setup.exe[latest version] seems to work 'mostly' OK. If you need a pure X11 client, try the [free Xming package].
http://sourceforge.net/projects/xming/files/Xming-fonts/7.5.0.47/Xming-fonts-7-5-0-47-setup.exe[this package proivdes a lot more fonts for the system]
==== Logging In/Out
--------------------------------------------------------------------------------
ssh -Y [UCINETID]@hpc.oit.uci.edu # getting to HPC
# the '-Y' tunnels the X11 stream thru your ssh connection
exit # how you exit the login shell
logout # ditto
^D # ditto
--------------------------------------------------------------------------------
==== History & Commandine editing
When you type 'Enter' to execute a command, bash will keep a history of that command. You can re-use that history to recall and edit previous commands
--------------------------------------------------------------------------------
history
history | grep 'obnoxiously long command'
alias hgr="history | grep"
--------------------------------------------------------------------------------
Rather than re-format another page, here is a http://www.math.utah.edu/docs/info/features_7.html[decent guide to commandline editing in bash]. In the very worst case, the arrow keys should work as expected.
==== Where am I? and what's here?
--------------------------------------------------------------------------------
pwd # where am I?
# ASIDE: help yourself: do the following:
# (it makes your prompt useful and tells you where you are)
echo "PS1='\n\t \u@\h:\w\n\! \$ '" >> ~/.bashrc
. ~/.bashrc
ls # what files are here? (tab completion)
ls -l
ls -lt
ls -lthS
alias nu="ls -lt |head -20" # this is a good alias to have
cd # cd to $HOME
cd - # cd to dir you were in last (flip/flop cd)
cd .. # cd up 1 level
cd ../.. # cd up 2 levels
cd dir # also with tab completion
tree # view the dir structure pseudo graphically - try it
tree | less # from your $HOME dir
tree /usr/local |less
mc # Midnight Commander - pseudo graphical file browser, w/ mouse control
du # disk usage
du -shc *
df -h # disk usage (how soon will I run out of disk)
--------------------------------------------------------------------------------
=== DirB and bookmarks.
DirB is a way to bookmark directories around the filesystem so you can 'cd' to them without all the typing.
It's [described here] in more detail and requires minimal setup, but after it's done you can do this:
--------------------------------------------------------------------------------
hmangala@hpc:~ # makes this horrible dir tree
512 $ mkdir -p obnoxiously/long/path/deep/in/the/guts/of/the/file/system
hmangala@hpc:~
513 $ cd !$ # cd's to the last string in the previous command
cd obnoxiously/long/path/deep/in/the/guts/of/the/file/system
hmangala@hpc:~/obnoxiously/long/path/deep/in/the/guts/of/the/file/system
514 $ s jj # sets the bookmark to this dir as 'jj'
hmangala@hpc:~/obnoxiously/long/path/deep/in/the/guts/of/the/file/system
515 $ cd # takes me home
hmangala@hpc:~
516 $ g jj # go to the bookmark
hmangala@hpc:~/obnoxiously/long/path/deep/in/the/guts/of/the/file/system
517 $ # ta daaaaa
--------------------------------------------------------------------------------
=== Permissions: chmod & chown
Linux has a Unix heritage so everything has an owner and a set of permissions. When you ask for an 'ls -l' listing, the 1st column of data lists the following:
--------------------------------------------------------------------------------
$ ls -l |head
total 14112
-rw-r--r-- 1 hjm hjm 59381 Jun 9 2010 a64-001-5-167.06-08.all.subset
-rw-r--r-- 1 hjm hjm 73054 Jun 9 2010 a64-001-5-167.06-08.np.out
-rw-r--r-- 1 hjm hjm 647 Apr 3 2009 add_bduc_user.sh
-rw-r--r-- 1 hjm hjm 1342 Oct 18 2011 add_new_claw_node
drwxr-xr-x 2 hjm hjm 4096 Jun 11 2010 afterfix/
|-+--+--+-
| | | |
| | | +-- other permissions
| | +----- group permissions
| +-------- user permissions
+---------- directory bit
drwxr-xr-x 2 hjm hjm 4096 Jun 11 2010 afterfix/
| | | +-- other can r,x
| | +----- group can r,x
| +-------- user can r,w,x the dir
+---------- it's a directory
# now change the 'mode' of that dir using 'chmod':
chmod -R o-rwx afterfix
||-+-
|| |
|| +-- change all attributes
|+---- (minus) remove the attribute characteristic
| can also add (+) attributes, or set them (=)
+----- other (everyone other than user and explicit group)
$ ls -ld afterfix
drwxr-x--- 2 hjm hjm 4096 Jun 11 2010 afterfix/
# Play around with the chmod command on a test dir until you understand how it works
--------------------------------------------------------------------------------
'chown' (change ownership) is more direct; you specifically set the ownership to what
you want, altho on HPC, you'll have limited ability to do this since 'you can only change
your group to to another group of which you're a member'. You can't change ownership of
a file to someone else, unless you're root.
--------------------------------------------------------------------------------
$ ls -l gromacs_4.5.5.tar.gz
-rw-r--r-- 1 hmangala staff 58449920 Mar 19 15:09 gromacs_4.5.5.tar.gz
^^^^^
$ chown hmangala.stata gromacs_4.5.5.tar.gz
$ ls -l gromacs_4.5.5.tar.gz
-rw-r--r-- 1 hmangala stata 58449920 Mar 19 15:09 gromacs_4.5.5.tar.gz
^^^^^
--------------------------------------------------------------------------------
=== Moving, Editing, Deleting files
These are utilities that create and destroy files and dirs. *Deletion on Linux is not warm and fuzzy*. It is quick, destructive, and irreversible. It can also be recursive.
.Warning: Don't joke with a Spartan
[WARNING]
==================================================================================
Remember the movie '300' about Spartan warriors? Think of Linux utilities like Spartans. Don't joke around. They don't have a great sense of humor and they're trained to obey without question. A Linux system will commit suicide if you ask it to.
==================================================================================
--------------------------------------------------------------------------------
rm my/thesis # instantly deletes my/thesis
alias rm="rm -i" # Please God, don't let me delete my thesis.
# alias logout="echo 'fooled ya'" can alias the name of an existing utility for anything.
# unalias is the anti-alias.
mkdir dirname # for creating dirname
rmdir dirname # for destroying dirname if empty
cp from/here to/there # COPIES from/here to/there
mv from/here to/there # MOVES from/here to/there (from/here is deleted!)
file this/file # what kind of file is this/file?
nano/joe/vi/vim/emacs # terminal text editors
gedit/nedit/jedit/xemacs # GUI editors
--------------------------------------------------------------------------------
=== STDIN STDOUT STDERR: Controlling data flow
These are the input/output channels that Linux provides for communicating among your input, and program input and output
- *STDIN*, usually attached to the keyboard. You type, it goes thru STDIN and shows up on STDOUT
- *STDOUT*, usually attached to the terminal screen. Shows both your STDIN stream and the program's STDOUT stream as well as ...
- *STDERR*, also usually connected to the terminal screen, which as you might guess, sometimes causes problems when both STDOUT and STDERR are both writing to the screen.
--------------------------------------------------------------------------------
wc < a_file # wc reads a_file on STDIN and counts lines, words, characters
ls -1 > a_file # ls -1 creates & writes to 'a_file' on STDOUT
ls -1 >> a_file # ls -1 appends to 'a_file' on STDOUT
someprogram 2> the_stderr_file # the STDERR creates and writes into 'the_stderr_file'
someprogram &> stdout_stderr_file # both STDER and STDOUT are captured in 'stdout_stderr_file'
# A real example
module load tacg # module sets up all the PATHS, etc for tacg
tacg -n6 -slLc -S -F2 < tacg-4.6.0-src/Seqs/hlef.seq | less # what does this do?
# vs
tacg -n6 -slLc -S -F2 < tacg-4.6.0-src/Seqs/hlef.seq > less # what does this do?
tacg -n6 -slLc -S -F2 < tacg-4.6.0-src/Seqs/hlef.seq > out # what does this do?
--------------------------------------------------------------------------------
==== Pipes
You can redirect all these channels via various currently incomprehensible incantations, but this is a really important concept, http://www.tldp.org/LDP/abs/html/io-redirection.html[explained in more detail here]. A program's STDOUT output can be connected to another program's STDIN input via a *pipe* which is represented in Linux as *|* (vertical bar).
For example:
--------------------------------------------------------------------------------
ls -1 # prints all the files in the current dir 1 line at a time
wc # counts the lines, words, characters passed to it via STDIN
ls -1 | wc # the number of lines, etc that 'ls -1' provides, or the # of files in the dir
--------------------------------------------------------------------------------
This is a powerful concept in Linux that doesn't have a good parallel in Windows.
=== Viewing & Slicing Data
==== Pagers, head & tail
'less' & 'more' are pagers, used to view text files. In my opinion, 'less' is better than 'more', but both will do the trick.
--------------------------------------------------------------------------------
less somefile # try it
alias less='less -NS' # is a good setup (number lines, scroll for wide lines)
head -### file2view # head views the top ### lines of a file, tail views the bottom
tail -### file2view # tail views the ### bottom lines of a file
tail -f file2view # keeps dumping the end of the file if it's being written to.
---------------------------------------------------------------------------------
==== Concatenating files
Sometimes you need to concatenate / aggregate files; for this, 'cat' is the cat's meow.
--------------------------------------------------------------------------------
cat file2view # dumps it to STDOUT
cat file1 file2 file3 > file123 # or concatenates multiple files to STDOUT, captured by '>' into file123
--------------------------------------------------------------------------------
==== Slicing out columns, rectangular selections
'cut' and http://moo.nac.uci.edu/~hjm/scut_cols_HOWTO.html[scut] allow you to slice out columns of data by acting on the 'tokens' by which they're separated. A 'token' is just the delimiter between the columns, typically a space or <tab>, but it could be anything, even a regex. 'cut' only allows single characters as tokens, 'scut' allows any regex as a token.
--------------------------------------------------------------------------------
cut -f# -d[delim char] # cuts out the fth field (counts from 1)
scut -f='2 8 5 2 6 2' -d='pcre delim' # cuts out whatever fields you want;
# allows renumbering, repeating, with a 'perl compatible regular expression' delimiter
--------------------------------------------------------------------------------
Use http://moo.nac.uci.edu/~hjm/scut_cols_HOWTO.html#_the_cols_utility[cols] to view data aligned to columns.
--------------------------------------------------------------------------------
cols < MS21_Native.txt | less # aligns the top lines of a file to view in columns
# compare with
less MS21_Native.txt
--------------------------------------------------------------------------------
Many editors allow columnar selections and for small selections this may be the best approach
Linux editors that support rectangular selection
[options="header"]
|========================================================================================
|Editor |Rectangular Select Activation
|nedit |Ctrl+Lmouse = column select
|jedit |Ctrl+Lmouse = column select
|kate |Shift+Ctrl+B = block mode, have to repeat to leave block mode.
|emacs |dunno - emacs is more a lifestyle than an editor but it can be done.
|vim |Ctrl+v puts you into visual selection mode.
|========================================================================================
==== Finding file differences and verifying identity
Quite often you're interested the differences between 2 related files or verifying that the file you sent is the same one as arrived. 'diff' and especially the GUI wrappers (diffuse, kompare) can tell you instantly.
--------------------------------------------------------------------------------
diff file1 file1a # shows differences between file1 and file2
diff hlef.seq hlefa.seq # on hpc
md5sum files # lists MD5 hashes for the files
# md5sum is generally used to verify that files are identical after a transfer.
# md5 on MacOSX, <http://goo.gl/yCIzR> for Windows.
--------------------------------------------------------------------------------
=== The grep family
Sounds like something blobby and unpleasant and sort of is, but it's VERY powerful.
http://en.wikipedia.org/wiki/Regex[Regular Expressions] are formalized patterns. As such they are not exactly easy to read at first, but it gets easier with time.
The simplest form is called http://en.wikipedia.org/wiki/Glob_(programming)[globbing] and is used within bash to select files that match a particular pattern
--------------------------------------------------------------------------------
ls -l *.pl # all files that end in '.pl'
ls -l b*. # all files that start with 'b' & end in '.pl'
ls -l b*p*.*l # all files that start with 'b' & have a 'p' & end in 'l'
--------------------------------------------------------------------------------
Looking at nucleic acids, can we encode this into a regex?:
gyrttnnnnnnngctww = g[ct][ag]tt[acgt]{7}gct[at][at]
--------------------------------------------------------------------------------
grep regex files # look for a regular expression in these files.
grep -rin regex * # recursively look for this case-INsensitive regex in all files and
# dirs from here down to the end and number the lines.
grep -v regex files # invert search (everything EXCEPT this regex)
egrep "thisregex|thatregex" files # search for 'thisregex' OR 'thatregex' in these files
egrep "AGGCATCG|GGTTTGTA" hlef.seq
# gnome-terminal allows searching in output, but not as well as 'konsole'
--------------------------------------------------------------------------------
http://www.regular-expressions.info/quickstart.html[This is a pretty good quickstart resource for learning more about regexes].
=== Getting Data to and from HPC
--------------------------------------------------------------------------------
scp paths/to/sources/* user@host:/path/to/target # always works on Linux/MacOSX
scp ~/nacs/nco-4.2.5.tar.gz hmangala@hpc.oit.uci.edu:~
wget URL # drops file in current dir
wget http://hpc.oit.uci.edu/biolinux/data/tutorial_data/hlef.seq
curl URL # spits out file to STDOUT
curl http://hpc.oit.uci.edu/biolinux/data/tutorial_data/hlef.seq # ugh...
curl -O http://hpc.oit.uci.edu/biolinux/data/tutorial_data/hlef.seq # better
# What does this do then?
curl http://hpc.oit.uci.edu/biolinux/data/tutorial_data/numpy-1.6.1.tar.gz | tar -xzvf -
# rsync 'syncs' files between 2 places by only transferring the bits that have changed.
# (there is also a GUI version of rsync called grsync on HPC.
# the following will not work for you (I hope) since you don't have my passwords
# modify it to try an rsync to your Mac or to another directory on HPC.
rsync -avz nco-4.2.5 hjm@moo:~/backups/hpc/nco-4.2.5
echo "dfh lkhslkhf" >> nco-4.2.5/configure
rsync -avz nco-4.2.5 hjm@moo:~/backups/hpc/nco-4.2.5 # what's the difference?
--------------------------------------------------------------------------------
==== tar and zip archives
--------------------------------------------------------------------------------
tar -czvf tarfile.gz files2archive # create a compressed 'tarball'
tar -tzvf tarfile.gz # list the files in a compressed 'tarball'
tar -xzvf tarfile.gz # extract a compressed 'tarball'
tar -xzvf tarfile.gz included/file # extract a specific 'included/file' from the archive.
# Also, archivemount to manipulate files while still in an archive
# zip is similar, but has a slightly different syntax:
zip zipfilename files2archive # zip the 'files2archive' into the 'zipfilename'
unzip -l zipfilename.zip # list the files that are in zipfilename.zip without unzipping them
unzip zipfilename.zip # unzip the 'zipfilename.zip' archive into the current dir.
--------------------------------------------------------------------------------
There are also a number of good GUI file transfer apps such as http://winscp.net/[WinScp] and http://cyberduck.ch/[CyberDuck] that allow drag'n'drop file transfer between panes of the application.
=== Info about & Controlling your jobs
--------------------------------------------------------------------------------
top # lists which are the top CPU-consuming jobs on the node
ps # lists all the jobs which match the options
ps aux # all jobs
ps aux | grep hmangala # all jobs owned by hmangala
alias psg="ps aux | grep"
kill -9 JobPID# # kill off your job by PID
--------------------------------------------------------------------------------
=== Your terminal sessions
You may be spending a lot of time in the terminal session and sometimes the terminal just screws up. If so, you can try typing 'clear' or 'reset' which should reset it.
You will often find yourself wanting multiple terminals to hpc. You can usually open multiple tabs on your terminal but you can also use the 'byobu' app to multiplex your terminal 'inside of one terminal window'. https://help.ubuntu.com/community/Byobu[Good help page on byobu here.]
The added advantage of using 'byobu' is that the terminal sessions that you open will stay active after you 'detach' from them (usually by hitting 'F6'). This allows you to maintain sessions across logins, such as when you have to sleep your laptop to go home. When you start 'byobu' again at HPC, your sessions will be exactly as you left them.
.A 'byobu' shell in not quite the same as using a direct terminal connection
[NOTE]
==========================================================================================
Because 'byobu' invokes some deep magic to set up the multiple screens, X11 graphics invoked from a
'byobu'-mediated window will 'sometimes' not work, depending on how many levels of shell you've descended. Similarly, 'byobu' traps mouse actions so things that might work in a direct connection (mouse control of 'mc') will not work in a 'byobu' shell. Also some line characters will not format properly. Always tradeoffs...
==========================================================================================
=== Background and Forground
Your jobs can run in the 'foreground' attached to your terminal, or detached in the 'background', or simply 'stopped'.
Deep breath.....
- a job runs in the 'foreground' unless sent to the 'background' with '&' when started.
- a 'foreground' job can be 'stopped' with 'Ctrl+z' (think zap or zombie)
- a 'stopped' job can be started again with 'fg'
- a 'stopped' job can be sent to the 'background' with 'bg'
- a 'background' job can be brought to the foregound with 'fg'
If you were going to run a job that takes a long time to run, you could run it in the background with this command.
--------------------------------------------------------------------------------
tar -czf gluster-sw.tar.gz gluster-sw & # This would run the job in the background immediately
...
[1]+ Done tar -czvf gluster-sw.tar.gz gluster-sw
tar -czvf gluster-sw.tar.gz gluster-sw & # Why would this command be sub-optimal?
--------------------------------------------------------------------------------
HOWEVER, for most long-running jobs, you will be submitting the jobs to the scheduler to run in 'batch mode'. See link:#qsub[here for how to set up a qsub run].
=== Finding files with 'find' and 'locate'
Even the most organized among you will occasionally lose track of where your files are. You can generally find them on HPC by using the 'find' command:
--------------------------------------------------------------------------------
# choose the nearest dir you remember the file might be and then direct find to use that starting point
find [startingpoint] -name filename_pattern
# ie: (you can use globs but they have to be 'escaped' with a '\'
find gluster-sw/src -name config\*
gluster-sw/src/glusterfs-3.3.0/argp-standalone/config.h
gluster-sw/src/glusterfs-3.3.0/argp-standalone/config.h.in
gluster-sw/src/glusterfs-3.3.0/argp-standalone/config.log
gluster-sw/src/glusterfs-3.3.0/argp-standalone/config.status
gluster-sw/src/glusterfs-3.3.0/argp-standalone/configure
gluster-sw/src/glusterfs-3.3.0/argp-standalone/configure.ac
gluster-sw/src/glusterfs-3.3.0/xlators/features/marker/utils/syncdaemon/configinterface.py
# 'locate' will work on system files, but not on user files. Useful for looking for libraries,
# but probably not in the module files
locate libxml2 |head # try this
# Also useful for searching for libs is 'ldconfig -v', which searches thru the LD_LIBRARY_PATH
ldconfig -v |grep libxml2
--------------------------------------------------------------------------------
=== Modules
'Modules' are how we maintain lots of different applications with mutiple versions without (much) confusion. In order to load a particular module, you have to call it up (with the specific version if you don't want the latest one).
--------------------------------------------------------------------------------
module load app/version # load the module
module whatis app # what does it do?
module rm app # remove this module (doesn't delete the module, just removes the paths to it)
module purge # removes ALL modules loaded (provides you with a pristine environment)
--------------------------------------------------------------------------------
== VERY simple bash programming
These code examples do not use the 'bash $' prefix since they're essentially stand-alone. The bash commands that are interspersed with the R commands later are so prefixed.
=== bash variables
Remember variables from math? A variable is a symbol that can hold the value of something else. In most computer languages (including 'bash') a variable can contain:
- numeric values (156, 83773.34 3.5e12, -2533)
- strings ("booger", "nutcase", "24.334", "and even this phrase")
- lists or arrays ([12 25 64 98735 72], [238.45 672.6645 0.443 -51.002] or ["if" "you" "have" "to" "ask" "then" "maybe" "..."].
Note that in lists or arrays, the values usually are of the same type (integers, floats, strings, etc). Most languages also allow the use of more highly complex data types (often referred to as data structures, objects, dataframes, etc). Even 'bash' allows you to do this, but it's so ugly that you'd be better off gouging out your own eyeballs. Use one of Perl, Python, R, Java, etc.
All computer languages allow comments. Often (bash, perl, python, R) the comment
indicator is a # which means that anything after the # is ignored.
[source,bash]
-------------------------------------------------------
thisvar="peanut" # note the spacing
thisvar = "peanut" # what happened? In bash, spacing/whitespace matter
thatvar="butter"
echo thisvar
# ?? what happened? Now..
echo $thisvar # what's the difference?
# note that in some cases, you'll have to protect the variable name with {}
echo $somevar_$thatvar # what's the difference between this
echo ${somevar}_${thatvar} # and this?
-------------------------------------------------------
You can use variables to present the results of system commands if they're inside of parens ()
-------------------------------------------------------
seq 1 5 $ what does this do?
filecount=$(seq 1 5)
echo $filecount # what's the difference in output?
dirlist=$(ls -1)
echo $dirlist
-------------------------------------------------------
=== Looping with bash
Simple bash loops are very useful
[source,bash]
-------------------------------------------------------
for file in *.txt; do
grep noodle $file
done
-------------------------------------------------------
=== Iterating with loops
[source,bash]
-------------------------------------------------------
for outer in $(seq 1 5); do
for inner in $(seq 1 2); do
# try this with and without variable protection
echo "processing file bubba_${inner}_${outer}"
done
done
-------------------------------------------------------
Can also use this to creat formatted numbers (with leading 0's)
[source,bash]
-------------------------------------------------------
for outer in $(seq -f "%03g" 1 5); do
for inner in $(seq -f "%02g" 1 2); do
# try this with and without variable protection
echo "processing file bubba_${inner}_${outer}"
done
done
-------------------------------------------------------
There are 1000's of pages of bash help available. http://www.tldp.org/LDP/abs/html/index.html[Here's a good one].
For very simple actions and iterations, especially for apply a set of commands to a set of files, bash is very helpful. For more complicated programmatic actions, I strongly advise using Perl or Python.
[[qsub]]
== Simple 'qsub' scripts
For all jobs that take more than a few seconds to run, you'll be submitting them to the Grid Engine Scheduler. In order to do that, you have to write a short bash script that describes what resources your job will need, how you want your job to produce output, how you want to be notified, what modules your job will need, and perhaps some data staging instructions. Really not hard.
Note that in qsub scripts, the GE directives are 'protected' by '#$'. The 1st '#' means that as far as 'bash' is concerned, they're comments and thus ignored. It's only GE that will pay attention to the directives behind '#$'.
Note that using 'bash' variables can make your scripts much more reliable, readable, and maintainable.
A qsub script consists of 4 parts:
- the 'shebang' line, the name of the shell that will execute the script (almost always '#!/bin/bash')
- comments, prefixed with '#' that inserted to document what your script does (to both yourself and to us, if you ask for help)
- the GE directives, prefixed by '#$' which set up various conditions and requirements for your job with the scheduler
- the 'bash' commands that actually describe what needs to be done
.A simple self-contained example qsub script
[source,bash]
-------------------------------------------------------
#!/bin/bash
# which Q to use?
#$ -q _______
# the qstat job name
#$ -N SLEEPER_1
# use the real bash shell
#$ -S /bin/bash
# mail me ...
#$ -M hmangala@uci.edu
# ... when the job (b)egins, (e)nds
#$ -m be
echo "Job begins"
date
sleep 30
date
echo "Job ends"
-------------------------------------------------------
.A generic qsub script
[source,bash]
-------------------------------------------------------
#!/bin/bash
# specify the Q run in
#$ -q asom
# the qstat job name
#$ -N RedPstFl
# use the real bash shell
#$ -S /bin/bash
# execute the job out of the current dir and direct the error
# (.e) and output (.o) to the same directory ...
#$ -cwd
# ... Except, in this case, merge the output (.o) and error (.e) into one file
# If you're submitting LOTS of jobs, this halves the filecount and also allows
# you to see what error came at what point in the job. Recommended unless there's
# good reason NOT to use it.
#$ -j y
#!! NB: that the .e and .o files noted above are the STDERR and STDOUT, not necessarily
#!! the programmatic output that you might be expecting. That output may well be sent to
#!! other files, depending on how the app was written.
# mail me (hmangala@uci.edu)
#$ -M hmangala@uci.edu
# when the job (b)egins, (e)nds, (a)borts, or (s)uspends
#$ -m beas
############## here
#!! Now note that the commands are bash commands - no more hiding them behind '#$'
# set an output dir in ONE place on the LOCAL /scratch filesystem (data doesn't cross the network)
# note - no spaces before or after the '='
OUTDIR=/scratch/hmangala
# set an input directory.
INDIR=/som/hmangala/guppy_analysis/2.3.44/broad/2013/12/46.599
# and final results dir in one place
FINALDIR=/som/hmangala/guppy/r03-20-2013
MD5SUMDIR=${HOME}/md5sums/guppy/r03-20-2013/md5sums
# make output dirs in the local /scratch and in data filesystem
# I assume that the INDIR already exists with the data in it.
mkdir -p $OUTDIR $FINALDIR
# load the required module
module load guppy/2.3.44
# and execute this command
guppy --input=${INDIR}/input_file --outdir=${OUTDIR} -topo=flat --tree --color=off --density=sparse
# get a datestamp
DATESTAMP=`date|tr /\ / /_/`
# generate md5 checksums of all output data to be able to
# check for corruption if anything happens to the filesystem, god forbid.
md5deep -r $OUTDIR > ${MD5SUMDIR}/md5sums
# copy the md5sums to the output dir to include it in the archive
cp ${MD5SUMDIR}/md5sums ${FINALDIR}
# mail the md5sums to yourself for safekeeping
cat ${MD5SUMDIR}/md5sums | mail -s 'md5sums from HPC' hmangala@uci.edu
# after it finishes, tar up all the data
tar -czf ${FINALDIR}/${DATESTAMP}.tar.gz ${OUTDIR}
# and THIS IS IMPORTANT!! Clean up behind you.
rm -rf $OUTDIR
-------------------------------------------------------
== A few seconds with Perl
We're going to be stepping thru a simple Perl script below, but in order to make it even simpler, here is the 'core' of the program, and it tends to be the core of a huge number of Perl scripts.
But 1st, a word about words .. and 'delimiters' or 'tokens'.
-------------------------------------------------------
The rain in Spain falls mainly on the plain.\n
0 1 2 3 4 5 6 7 8
-------------------------------------------------------
Similarly:
-------------------------------------------------------
-rw-r--r-- 1 root root 69708 Mar 13 22:02 ,1
-rw-r--r-- 1 root root 69708 Mar 13 22:02 ,2
drwxr-xr-x 2 root root 4096 Feb 5 17:07 3ware
-rw------- 1 root root 171756 Dec 13 18:15 anaconda-ks.cfg
-rw-r--r-- 1 root root 417 Feb 22 15:38 bigburp.pl
drwxr-xr-x 2 root root 4096 Mar 26 14:27 bin
drwxr-xr-x 6 root root 4096 Mar 22 15:08 build
-rw-r--r-- 1 root root 216 Feb 22 15:38 burp.pl
drwxr-xr-x 6 root root 4096 Mar 26 15:07 cf
0 1 2 3 4 5 6 7 8
-------------------------------------------------------
The above outputs all have 8 text fields and CAN be separated on spaces (' ').
But what happens in the case below?
-------------------------------------------------------
-rw-r--r-- 1 root root 216 Feb 22 15:38 burp.pl\n
_ _ _ _______ _ _ _ _
-------------------------------------------------------
Breaking on 'spaces' will result not in 8 fields but in 14.
Breaking on 'whitespace' will result in 8 fields.
And in this one?
-------------------------------------------------------
The rain in Spain falls mainly on the plain.\n
-------------------------------------------------------
Note the leading space? that counts as well.
.The Core Perl 'while' loop - the *lineeater*
[source,perl]
-------------------------------------------------------
#!/usr/bin/perl -w
while (<>) { # while there's still STDIN, read it line by line
# $N is the # of fields; @A is the array that gets populated
$N = @A = split(/splitter/, $what_to_split);
# do something with the @A array and generate $something_useful
print "$something_useful\n";
}
-------------------------------------------------------
.The setup
[source,bash]
-------------------------------------------------------
# from your account on HPC
wget http://moo.nac.uci.edu/~hjm/biolinux/justright.pl
chmod +x justright.pl
# and execute it as below:
ls -l /usr/bin | ./justright.pl 12121 131313
# what does it do? Let's break it down.. The first part is..
ls -l /usr/bin | cols -d='\s+' | less # what does this do?
Change the input parameters and try it again to see if you can tell
-------------------------------------------------------
Well, we have the source code below, so we don't have to guess. Feel free to open it in an editor, modify it, and try it again.
.justright.pl
[source,perl]
-------------------------------------------------------
#!/usr/bin/perl -w
$min = shift; # assigns 1st param (12121) to $min
$max = shift; # assigns 2nd param (131313) to $max
while (<>){ # consumes each line in turn and processes it below
chomp; # removes the trailing newline (\n)
@A = split(/\s+/, $_); # splits $_ on whitespace and loads each el into an array named A
if ($A[4] < $min) { # if $A[4] is too small...
print "ERR: too small ($A[4]):\t[$A[8]]\n"; # print an error
} elsif ($A[4] > $max) { # if it's too big...
print "ERR: too big ($A[4]):\t[$A[8]]\n"; # print an error
} else {
print "INFO: just right:\t[$A[8]]\n"; # it's just right
print "$_\n\n"; # so print the whole line
}
}
--------------------------------------------------------
There are 1000's of pages of Perl help available. http://learn.perl.org/[Here's a good one].
== The R Language
R is a programming language designed specifically for statistical computation. As such, it has many of the characteristics of a general-purpose language including iterators, control loops, network and database operations, many of which are useful, but in general not as easy to use as the more general http://en.wikipedia.org/wiki/Python_(programming_language)[Python] or http://en.wikipedia.org/wiki/Perl[Perl].
R can operate in 2 modes:
- as an interactive interpreter (the R shell, much like the bash shell), in which you start the shell and type interactive commands. This is what we'll be using today.
- as a scripting language, much like Perl or Python is used, where you assemble serial commands and have the R interpreter run them in a script. Many of the bioinformatics 'applications' writ in R have this form.
Typically the R shell is used to try things out and the serial commands saved in a file are used to automate operations once the sequence of commands is well-defined and debugged. In approach, this is very similar to how we use the bash shell.
== R is Object Oriented
While R is not a great language for procedural programming, it does excel at mathematical and statistical manipulations. However, it does so in an odd way, especially for those who have done procedural programming before. R is quite 'object-oriented' (OO) in that you tend to deal with 'data objects' rather than with individual integers, floats, arrays, etc. The best way to think of R data if you have programmed in 'C' is to think of R data (typically termed 'tables' or 'frames') as C 'structs', arbitrary and often quite complicated structures that can be dealt with by name. If you haven't programmed in a procedural language, it may actually be easier for you because R manipulates chunks of data similar to how you might think of them. For example, 'multiply column 17 by 3.54' or 'pivot that spreadsheet'.
The naming system for R's data objects is similar to other OO naming systems. The sub-object is defined by the name after a "." So a complex data structure named 'hsgenome' might have sub-objects named chr1, chr2, etc, and each of these would have a length associated with it. For example 'hsgenome.chr1.length' might equal 250,837,348, while 'hsgenome.chr1.geneid341.exon2.length' might equal 4347. Obvious, no?
For a still-brief but more complete overview of R's history, see http://en.wikipedia.org/wiki/R_(programming_language)[Wikipedia's entry for R].
== Available Interfaces for R
R was developed as a commandline language. However, it has gained progressively more graphics capabilities and graphical user interfaces (GUIs). Some notable examples:
- the once 'de facto' standard R GUI http://socserv.mcmaster.ca/jfox/Misc/Rcmdr/[R Commander],
- the fully graphical statistics package http://gretl.sourceforge.net/[gretl] which was developed external to R for time-series econometrics but which now supports R as an external module
- http://www.walware.de/goto/statet[Eclipse/StatET], an Eclipse-based Integrated Development Environment (IDE)
- http://www.rstudio.com/[Rstudio], probably the most advanced R GUI, another full IDE, already available on HPC
- the extremely powerful interactive graphics utility http://www.ggobi.org[ggobi], which is not an IDE, but a GUI for multivariate analysis and data exploration.
As opposed to fully integrated commercial applications which have a cohesive and coherent interface, these packages differ slightly in the way that they approach getting things done, but in general the mechanisms follow generally accepted user interface conventions.
In addition, many routines in R are packaged with their own GUIs so that when called from the commandline interface, the GUI will pop up and allow the user to interact with a mouse rather than the keyboard. The extrmely powerful interactive graphics utility http://www.ggobi.org[ggobi] is such a GUI for multivariate analysis and data exploration.
== Getting help on R
As noted below, when you start R, the 1st screen will tell you how to get local help.
.Don't copy in the prompt string
[NOTE]
=================================================================
Don't forget that the listed prompts *bash $* for bash and *>* for R are
NOT meant to be copied into the shell.
ie if the the line reads:
-----------------------------------------------------------------
bash $ R # more comments on R's startup state below
-----------------------------------------------------------------
you copy in *R*.
You can copy in the comment line as well (the stuff after the *#*).
It will be ignored by bash, Perl, and R.
=================================================================
-----------------------------------------------------------------
bash $ R # more comments on R's startup state below
# most startup text deleted ...
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> help.start().
-----------------------------------------------------------------
So right away, there's some useful things to try. By all means try out the 'demo()'. it shows you what R can do, even if you don't know exactly what how to do it.
And the 'help.start()' will attempt to launch a browser page (from the machine on which R is running, HPC in the case of this tutorial) with a lot of links to R documentation, both introductory and advanced.
See the link:#resources[Resources] section at the bottom for many more links to R-related information.
[[datatypes]]
== Some basic R data types
Before we get too far, note that R has a variety of data structures that may be required for different functions.
Some of the main ones are:
* 'numeric' - (aka double) the default double-precision (64bit) float representation for a number.
* 'integer' - a single-precision (32bit) integer.
* 'single' - a single precision (32bit) float
* 'string' - any character string ("harry", "r824_09_hust", "888764"). Note the last is a number defined as a string so you couldn't add "888764" (a string) + 765 (an integer).
* 'vector' - a series of data types of the same type. (3, 5, 6, 2, 15) is a vector of integers. ("have", "third", "start", "grep) is a vector of strings.
* 'matrix' - an array of *identical* data types - all numerics, all strings, all booleans, etc.
* 'data.frame' - a table (another name for it) or array of mixed data types. A column of strings, a column of integers, a column of booleans, 3 columns of numerics.
* 'list' - a concatenated series of data objects.
== Starting R (finally)
We have several versions of R on the HPC cluster, but unless you want to start an older version (in which case you have to specify the version), just type:
-----------------------------------------------------------------
bash $ module load R
# if you then wanted to start Rstudio (and you had set up the necessary graphics options)
# you would then type:
bash $ rstudio &
# or if you want the commandline R
bash $ R # dumps version info, how to get help, and finally the R prompt
R version 2.15.2 (2012-10-26) -- "Trick or Treat"
Copyright (C) 2012 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
>
-----------------------------------------------------------------
== Some features of the R interpreter
R supports commandline editing in exactly the same way as the bash shell does. Arrow keys work, command history works, etc.
There is a very nice feature when you try to exit from R that will offer to save the entire environment for you so it can be recreated the next time you log in. If you're dealing with fairly small data sets (up to a couple hundred MB), it's very handy sometimes to save the environment. If you're dealing with a variable set of tens or hundreds of GB, it may not be worth it to try to save the environment, since all that data has to be saved to disk somewhere, duplicating the data in your input files, and also it has to be read back in on restart which may take just about as long as reading in the original files.
The key in the latter (huge data) case is to keep careful records of the commands that worked to transition from one stage of data to another.
If you get the line shown below:
-----------------------------------------------------------------
[Previously saved workspace restored]
-----------------------------------------------------------------
it indicates that you had saved the previous data environment and if you type
*ls()* you'll see all the previously created variables:
-----------------------------------------------------------------
> ls()
[1] "ge" "data.matrix" "fil_ss" "i" "ma5"
[6] "mn" "mn_tss" "norm.data" "sg5" "ss"
[11] "ss_mn" "sss" "tss" "t_ss" "tss_mn"
[16] "x"
-----------------------------------------------------------------
If the previous session was not saved, there will be no saved variables in the environment
-----------------------------------------------------------------
> ls()
character(0) # denotes an empty character (= nothing there)
-----------------------------------------------------------------
OK, now let's quit out of R and NOT save the environment
-----------------------------------------------------------------
> quit()
Save workspace image? [y/n/c]: n
-----------------------------------------------------------------
== Loading data into R
Let's get some data to load into R. Download http://moo.nac.uci.edu/~hjm/biolinux/GE_C+E.data[this file]
OK. Got it? Take a look thru it with 'less'
--------------------------------------------------------------------
bash $ less GE_C+E.data
# Notice how the columns don't line up? Try it again with cols
bash $ cols GE_C+E.data |less
# Notice that the columns now line up?
--------------------------------------------------------------------
Loading data into R is described in much more detail in the document http://cran.r-project.org/doc/manuals/R-data.pdf[R Data Import/Export], but the following
will give you a short example of one of the more popular ways of loading data into R.