Skip to content

Commit

Permalink
Merge pull request #17 from NIGMS/google-instance-testing
Browse files Browse the repository at this point in the history
Google instance testing
  • Loading branch information
kyleoconnell-NIH authored Feb 27, 2025
2 parents 89d2ff0 + 472119b commit 03775ca
Show file tree
Hide file tree
Showing 7 changed files with 299 additions and 446 deletions.
22 changes: 20 additions & 2 deletions GoogleCloud/submodule01_Intro_to_terminal.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -411,7 +411,7 @@
"source": [
"### Absolute vs. Relative Paths\n",
"\n",
"The command `pwd` returns the **absolute path** to your current working directory, the list of all directories and subdirectories to the get from the current directory to the root or home directory. You can see that absolute paths can get long and unwieldy, especially if you have very detailed or long directory names. \n",
"The command `pwd` returns the **absolute path** to your current working directory which is the list of all directories and subdirectories to get from the current directory to the root or home directory. You can see that absolute paths can get long and unwieldy, especially if you have very detailed or long directory names. \n",
"\n",
"One \"shortcut\" that makes navigating the command line a bit easier is using a **relative path**. A **relative path** uses the directory structure (which we can see in our absolute path returned by the command `pwd`) to move up or down through directories using shortcuts. One very common shortcut is `..` which translates to the directory one level \"above\" your current directory. \n",
"\n",
Expand Down Expand Up @@ -586,7 +586,25 @@
]
}
],
"metadata": {},
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel) (Local)",
"language": "python",
"name": "conda-base-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
30 changes: 28 additions & 2 deletions GoogleCloud/submodule02_Intro_to_cloud_computing.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@
"</td></tr> </table>\n",
"\n",
"\n",
"For this simple example we save milliseconds by using parallelization but when you're mapping millions of reads onto a reference genome parallelization significantly speeds up the process. Most modern machines will have a processors with between 6-8 cores, the number of cores dictates the amount of parallelization a task can utilize. However, virtual machines provide access to between 16-64 cores. A process that takes days on your local machine can be completed in hours. As with memory there is a limit to the utility of using more cores, if the task you are performing is not parallelizable or is minimally parallelizable and you build a VM instance with more cores than you can use, you will pay for resources that cannot be leveraged. \n",
"For this simple example we save milliseconds by using parallelization, but when you're mapping millions of reads onto a reference genome parallelization significantly speeds up the process. Most modern machines will have a processors with between 6-8 cores, the number of cores dictates the amount of parallelization a task can utilize. However, virtual machines provide access to between 16-64 cores. A process that takes days on your local machine can be completed in hours. As with memory there is a limit to the utility of using more cores, if the task you are performing is not parallelizable or is minimally parallelizable and you build a VM instance with more cores than you can use, you will pay for resources that cannot be leveraged. \n",
"\n",
"*There is an episode of The Magic School Bus where they have a simple explanation of storage, memory (RAM), and CPUs, season 4 episode 11 The Magic School Bus Gets Programmed*\n",
"\n",
Expand Down Expand Up @@ -154,9 +154,35 @@
"# The -r flag with cp to indicate we want to copy the directory and all of it's contents\n",
"gsutil -m cp -r gs://nigms-sandbox/gcp_research_workflow/ ."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "82d97eaf-1916-410a-adae-a7d9435a775b",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {},
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel) (Local)",
"language": "python",
"name": "conda-base-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
32 changes: 25 additions & 7 deletions GoogleCloud/submodule03_genomics_file_format.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
"source": [
"## Genomic Annotation Data, GFF/GTF file Format\n",
"------\n",
"The standard genomic annotation format is called a GFF/GTF file, this is a tab delimited (tsv) file that indicates the positions of genomic features of interest in a genome. \n",
"The standard genomic annotation format is called a GFF/GTF file. This is a tab delimited (tsv) file that indicates the positions of genomic features of interest in a genome. \n",
"\n",
"Features of interest could be genes, transcripts, non-coding RNA, tRNA, rRNA, 3'-UTRs, and more. When building a GFF file you have the option to select the features annotated. Most files include at least genes, transcripts, tRNA, and rRNA. \n",
"\n",
Expand Down Expand Up @@ -368,7 +368,7 @@
"source": [
"#### Decoding the Contents of a Fasta File\n",
"\n",
"Fasta files containing sequence data can be of many different types. Whole genome sequences, assembled scaffolds from reads, aligned protein sequences, or a collection of coding sequences from an organisms are several examples of the types of data you can find in a fasta file. \n",
"Fasta files containing sequence data can be of many different types. Whole genome sequences, assembled scaffolds from reads, aligned protein sequences, or a collection of coding sequences from organisms are several examples of the types of data you can find in a fasta file. \n",
"\n",
"Hopefully not often, but from time to time you may come across a fasta file you would like to work with, but are unsure of the contents of the file. There are a couple of features of various data types that you can use to determine the contents of the file. \n",
"\n",
Expand All @@ -379,7 +379,7 @@
"One last feature is the alphabet of letters in the sequence data. As I mentioned earlier there are rare cases where a fasta file encodes phenotypic characters and in those cases you will see mostly binary characters 1 and 0 in the sequence lines, but this is rare. Most sequences will be either nucleotide sequences or protein sequences, and these are pretty easily distinguished by their very different alphabets. The sequences in the top half of the figure above are nucleotide data in the fasta file format, this is clear from the limited sequence alphabet (*ATCG*). Sequences in the bottom half of the figure are in the protein format, indicated by the expanded sequence alphabet (*ACDEFGHIKLMNPQRSTVWY*). \n",
"\n",
"\n",
"I have added two unlabeled fasta files with sequence data to the directory `gcp_research_workflow`. let's look at some of the features we outlined above to determine the contents of each file, and then rename the files so that we have a better system for knowing what each file contains. \n",
"I have added two unlabeled fasta files with sequence data to the directory `gcp_research_workflow`. Let's look at some of the features we outlined above to determine the contents of each file, and then rename the files so that we have a better system for knowing what each file contains. \n",
"\n",
"Start by determining how many lines are in the file with the word count `wc` command using the flag `-l`.\n",
"\n",
Expand Down Expand Up @@ -486,11 +486,11 @@
"source": [
"#### head, tail, cat, & more\n",
"\n",
"Here we can see that the header lines indicate the species the sequences come from, and there are three different species listed, but there isn't any description of the sequence content in the header beyond that. let's look at some of the sequence data with the `head` and `tail` commands to determine what type of data is in this file. \n",
"Here we can see that the header lines indicate the species the sequences come from, and there are three different species listed, but there isn't any description of the sequence content in the header beyond that. Let's look at some of the sequence data with the `head` and `tail` commands to determine what type of data is in this file. \n",
"\n",
"We could look at the contents of the files with the `cat` command, which prints all lines of the file to the screen. For small files this is okay, but for larger files like fasta files this amount of output can be overwhelming and not very useful. \n",
"\n",
"One more way to view the contents of a file is the `more` command which displays data one \"screenfull\" of lines at a time (the number of lines depends on the size of your screen and the font size). This command is interactive so it won't work in the Jupyter notebook, but is very useful for scrolling through data in the terminal window. The `space-bar` key will 'scroll' one line at a time and the `return` key will 'scroll' one screen full of lines at a time. If you would like to stop scrolling and return to your prompt use the `ctrl + C` keys. The `ctrl +C` is a universal signal to the terminal that you would like to abort the command that is currently running."
"One more way to view the contents of a file is the `more` command which displays data one \"screenfull\" of lines at a time (the number of lines depends on the size of your screen and the font size). This command is interactive so it won't work in the Jupyter notebook but is very useful for scrolling through data in the terminal window. The `space-bar` key will 'scroll' one line at a time and the `return` key will 'scroll' one screen full of lines at a time. If you would like to stop scrolling and return to your prompt use the `ctrl + C` keys. The `ctrl +C` is a universal signal to the terminal that you would like to abort the command that is currently running."
]
},
{
Expand Down Expand Up @@ -746,7 +746,7 @@
"\n",
"Now let's determine the number of times the sequencer was not able to determine the base call for the first 10,000 reads. To do this we need to extract the sequence lines from the fastq entries, we will use the `sed` command to isolate the second line (sequence line) of each of the first 10,000 reads.\n",
"\n",
"1. `sed`'s' `'p'` argument tells the program we want the output to be printed, and the `-n` option to tell sed we want to suppress automatic printing (so we don't get the results printed 2x). We specify `'2~4p'` as we want sed to *print the 2cnd line, then skip forward 4* to the next line of sequence. \n",
"1. `sed`'s' `'p'` argument tells the program we want the output to be printed, and the `-n` option to tell sed we want to suppress automatic printing (so we don't get the results printed 2x). We specify `'2~4p'` as we want sed to *print the 2nd line, then skip forward 4* to the next line of sequence. \n",
"2. `grep` command with the `-o` flag to separate each letter (base pair) onto it's own line \n",
"3. `sort` command sorts the letters alphabetically\n",
"4. `grep` again isolates lines with a base call of **N** \n",
Expand Down Expand Up @@ -892,7 +892,25 @@
"source": []
}
],
"metadata": {},
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel) (Local)",
"language": "python",
"name": "conda-base-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading

0 comments on commit 03775ca

Please sign in to comment.