Skip to content

Commit

Permalink
Merge pull request #11 from NIGMS/AWS&GCP
Browse files Browse the repository at this point in the history
Updated architectural diagrams
  • Loading branch information
kyleoconnell-NIH authored Nov 1, 2024
2 parents 7b747b7 + 6d621aa commit 9eebc4b
Show file tree
Hide file tree
Showing 7 changed files with 13 additions and 95 deletions.
4 changes: 2 additions & 2 deletions AWS/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Included here are several submodules or tutorials in the form of Jupyter noteboo

### Creating a notebook instance

Follow the steps highlighted [here](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/HowToCreateAWSSagemakerNotebooks.md) to create a new notebook instance in AWS SageMaker. For this module you should select Linux 2 and Python 3 in the Environment. In the Notebook instance type tab, select ml.m5.xlarge from the dropdown box. It is **important to shut down** the kernel at the end of your work to avoid getting charged.
Follow the steps highlighted [here](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/HowToCreateAWSSagemakerNotebooks.md) to create a new notebook instance in AWS SageMaker. For this module you should select Linux 2 and Python 3 in the Environment. In the Notebook instance type tab, select ml.t3.2xlarge from the dropdown box and set volume size to 20 GB. It is **important to shut down** the kernel at the end of your work to avoid getting charged.

To use our module, open a new Terminal window from your new notebook instance and clone this repo using `git clone https://github.com/NIGMS/Fundamentals-of-Bioinformatics/AWS.git`. Navigate to the directory for this project. You will then see the notebooks in your environment.

Expand All @@ -39,7 +39,7 @@ multiqc

## **Architecture Design**

![workflow diagram](images/updated_Dartmouth_AD.png)
![workflow diagram](images/updated_Dartmouth_AD.svg)

As seen in the image above, we will download sequence files from the AWS S3 bucket to our SageMaker virtual machine. We will practice running BASH commands using the sequence files in the bucket, as well as get practice downloading sequence data from the SRA. Using the Conda package manager we will install and use FastQC, MultiQC, SRA tools, Spades, and Prokka to analyze data from the SRA. Lastly we will create a new AWS S3 bucket, and copy our analyzed data to the new bucket. We explain our submodules that execute these processes here:

Expand Down
Binary file removed AWS/images/updated_Dartmouth_AD.png
Binary file not shown.
1 change: 1 addition & 0 deletions AWS/images/updated_Dartmouth_AD.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
100 changes: 8 additions & 92 deletions AWS/submodule06_putting_it_all_together.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,10 @@
"Now you have been through most of the lessons and believe it or not you have the basics of the skills you need for most bioinformatic analyses. At this point you should be able to interact with your data through the terminal interface, you know how to use the help flag to learn how to use and customize BASH commands, you can build complex commands with the pipe and execute those commands on a set of files with looping and BASH scripts, you understand the data structure of several common genomic file formats, and you can install software with the Conda package manager. \n",
"\n",
"\n",
"## Creating a Conda Environment for Genome Assembly\n",
"## Genome Assembly\n",
"----------\n",
"\n",
"Best practices for creating conda environments are to create an environment for each analysis task that you need to perform. Here we will create two conda environments, one for genome assembly and a second for genome annotation using the `conda create` command. Your assembly environment will contain software for downloading sequences from the SRA (`sra-tools`), checking the quality of the raw data we download (`fastqc`), and assembling fastq reads into a complete genome (`spades`). Your annotation environment will contain software for annotating the assembly (`prokka`)."
"Best practice is to create different environments for each analysis task that you need to perform. However, here we will install packages to the base environment due to the complexity of creating conda environments on AWS SageMaker. For genome assembly, you will install software for downloading sequences from the SRA (`sra-tools`), checking the quality of the raw data we download (`fastqc`), and assembling fastq reads into a complete genome (`spades`). For annotation, you will install software for annotating the assembly (`prokka`)."
]
},
{
Expand All @@ -28,91 +28,8 @@
"source": [
"%%bash\n",
"\n",
"# create a conda environment called assembly with all required software\n",
"conda create -n assembly --channel bioconda python=3.9 ipykernel fastqc sra-tools spades -y"
]
},
{
"cell_type": "markdown",
"id": "cae160f6-55c2-41c4-a4e7-7adfcdec4e6d",
"metadata": {},
"source": [
"<div class=\"alert alert-block alert-info\">\n",
" <i class=\"fa fa-lightbulb-o\" aria-hidden=\"true\"></i>\n",
" <b>Tip: </b> Remember if the conda env isn't showing up after a couple minutes you can try running the following command to update the kernels available in your session. Dont forget to switch to the Python 3 kernel to be able to run the code otherwise you will get a \"No module named nb_conda_kernels\" error.\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "de4d4da5",
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"conda install -c conda-forge nb_conda_kernels -y"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "46d5fd0a-ed71-4fbf-82fa-0d40cbb3533a",
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"python -m nb_conda_kernels list"
]
},
{
"cell_type": "markdown",
"id": "aaaca03d-a968-4346-9ca0-41695f911ab8",
"metadata": {},
"source": [
"We will need two separate environments because of a discrepancy between dependencies in our assembly environment and the tool `prokka`, this is an excellent demonstration of why it is helpful to have multiple well organized conda environments."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e177b248",
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"# Create a second conda environment called annotation \n",
"conda create -n annotation --channel bioconda python ipykernel perl"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "41da74fb",
"metadata": {},
"outputs": [],
"source": [
"%%bash \n",
"\n",
"# Install required software \n",
"\n",
"conda install bioconda::perl-bioperl\n",
"\n",
"\n",
"conda install bioconda::prokka"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "96b0a573-59f7-433b-83db-991c0c62c79b",
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"# Create a second conda environment called annotation with all required software\n",
"# conda create -n annotation --channel bioconda python=3.9 ipykernel prokka perl-bioperl -y"
"# Install assembly and annotation software with all required software\n",
"mamba install -c bioconda fastqc sra-tools spades perl perl-bioperl prokka"
]
},
{
Expand Down Expand Up @@ -147,7 +64,7 @@
"----------\n",
"The sequence read archive is a database hosted by the National Center for Biotechnology Information (NCBI) as a long term storage repository for raw sequencing data. These data are often fairly large (depending on the organism being sequenced) and can take up a lot of space, so it is handy to have a way to access these data when you need them, be able to use the files then delete them and go back and get them if you need them again.\n",
"\n",
"Here we are downloading the sequence of a Sars CoV2 sample collected in western New Hampshire in March 2022. Make sure you are on the **assembly** kernel."
"Here we are downloading the sequence of a Sars CoV2 sample collected in western New Hampshire in March 2022. "
]
},
{
Expand All @@ -160,9 +77,9 @@
"%%bash\n",
"\n",
"# Pull the raw fastq files you need from the SRA\n",
"prefetch -v SRR18241034\n",
"aws s3 sync s3://sra-pub-run-odp/sra/SRR18241034/ SRR18241034/ --no-sign-request\n",
"\n",
"fastq-dump --outdir assembly_test/ --split-files SRR18241034/SRR18241034.sra\n",
"fasterq-dump --outdir assembly_test/ --split-files SRR18241034/SRR18241034\n",
"\n",
"# Remove the prefetch directory now that you are finished with it\n",
"rm -r SRR18241034"
Expand Down Expand Up @@ -287,8 +204,7 @@
"\n",
"\n",
"After annotation prokka returns fasta files containing both protein (.faa) and nucleotide (.fna) sequences as well as an annotation file (.gff) of annotated coding features. We have not used this software in these lessons but try to use the manual to determine what flags might be helpful to include in your annotation. \n",
"\n",
"Don't forget to change the conda environment you have loaded from assembly to **annotation**. In the notebook you can use the kernel selection window in the top right corner. In the terminal window use the command `conda deactivate` to leave one conda environment and `conda activate annotation` to enter another conda environment. \n"
" \n"
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion GoogleCloud/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ dependencies:

## **Workflow Diagrams**

![workflow diagram](images/updated_Dartmouth_AD.png)
![workflow diagram](images/updated_Dartmouth_AD.svg)

As seen in the image above, we will download sequence files from the Google bucket to our Vertex AI virtual machine. We will practice running BASH commands using the sequence files in the bucket, as well as get practice downloading sequence data from the SRA. Using the Conda package manager we will install and use FastQC, MultiQC, SRA tools, Spades, and Prokka to analyze data from the SRA. Lastly we will create a new Google bucket, and copy our analyzed data to the new bucket. We explain our submodules that execute these processes here:

Expand Down
Binary file removed GoogleCloud/images/updated_Dartmouth_AD.png
Binary file not shown.
1 change: 1 addition & 0 deletions GoogleCloud/images/updated_Dartmouth_AD.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 9eebc4b

Please sign in to comment.