From 90fe92d1f72b6bc46bb5c15ef6eabf06f2bc837a Mon Sep 17 00:00:00 2001 From: Kyle O'Connell Date: Fri, 24 May 2024 11:45:57 -0400 Subject: [PATCH] updated notebook formatting to align with standard format --- notebooks/GWAS/GWAS_coat_color.ipynb | 92 +- notebooks/GenAI/Pubmed_RAG_chatbot.ipynb | 170 +-- notebooks/SRADownload/SRA-Download.ipynb | 290 ++-- .../SpleenSeg_Pretrained-4_27.ipynb | 1202 ++--------------- notebooks/pangolin/pangolin_pipeline.ipynb | 124 +- .../RNAseq_pipeline.ipynb | 102 +- tutorials/README.md | 96 -- .../notebooks/GenAI/Pubmed_RAG_chatbot.ipynb | 868 ------------ 8 files changed, 566 insertions(+), 2378 deletions(-) delete mode 100644 tutorials/README.md delete mode 100644 tutorials/notebooks/GenAI/Pubmed_RAG_chatbot.ipynb diff --git a/notebooks/GWAS/GWAS_coat_color.ipynb b/notebooks/GWAS/GWAS_coat_color.ipynb index 711a4d8..c36fa85 100644 --- a/notebooks/GWAS/GWAS_coat_color.ipynb +++ b/notebooks/GWAS/GWAS_coat_color.ipynb @@ -6,8 +6,36 @@ "metadata": {}, "source": [ "# GWAS in the cloud\n", + "## Overview\n", "We adapted the NIH CFDE tutorial from [here](https://training.nih-cfde.org/en/latest/Bioinformatic-Analyses/GWAS-in-the-cloud/background/) and fit it to a notebook. We have greatly simplified the instructions, so if you need or want more details, look at the full tutorial to find out more.\n", - "Most of this notebook is bash, but expects that you are using a Python kernel, until step 3, plotting, you will need to switch your kernel to R." + "\n", + "Most of this notebook is written in Bash, but expects that you are using a Python kernel, until step 3, plotting where you will need to switch your kernel to R." + ] + }, + { + "cell_type": "markdown", + "id": "3edafe63", + "metadata": {}, + "source": [ + "## Learning Objectives\n", + "The goal is to learn how to execute a GWAS analysis in a cloud environment" + ] + }, + { + "cell_type": "markdown", + "id": "5d7ef396", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "+ You only need access to a Sagemaker notebook environment to run this notebook" + ] + }, + { + "cell_type": "markdown", + "id": "39ee9668", + "metadata": {}, + "source": [ + "## Get Started" ] }, { @@ -15,8 +43,9 @@ "id": "8fbf6304", "metadata": {}, "source": [ - "## 1. Setup\n", - "### Download the data\n", + "### Install packages and set up environment\n", + "\n", + "#### Download the data\n", "use %%bash to denote a bash block. You can also use '!' to denote a single bash command within a Python notebook" ] }, @@ -68,7 +97,31 @@ "tags": [] }, "source": [ - "## 1. Install dependencies" + "### Install dependencies" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9f5032d7", + "metadata": {}, + "outputs": [], + "source": [ + "# install mamba\n", + "! curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh\n", + "! bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1a5bd340", + "metadata": {}, + "outputs": [], + "source": [ + "# add to your path\n", + "import os\n", + "os.environ[\"PATH\"] += os.pathsep + os.environ[\"HOME\"]+\"/mambaforge/bin\"" ] }, { @@ -78,6 +131,7 @@ "metadata": {}, "outputs": [], "source": [ + "# install everything else\n", "! mamba install -y -c bioconda plink vcftools" ] }, @@ -86,7 +140,7 @@ "id": "3de2fc4c", "metadata": {}, "source": [ - "## 2. Analyze" + "## Analyze" ] }, { @@ -266,7 +320,7 @@ "id": "1f52e97c", "metadata": {}, "source": [ - "## 3. Plotting\n", + "## Plotting\n", "In this tutorial, plotting is done in R, so at this point you can change your kernel to R in the top right. Wait for it to say 'idle' in the bottom left, then continue. You could also plot using Python native packages and maintain the Python notebook kernel." ] }, @@ -359,6 +413,32 @@ "\n", "The top associated mutation is a nonsense SNP in the gene MC1R known to control pigment production. The MC1R allele encoding yellow coat color contains a single base change (from C to T) at the 916th nucleotide." ] + }, + { + "cell_type": "markdown", + "id": "2f6e1ef6", + "metadata": {}, + "source": [ + "### Conclusion\n", + "Here we learned how to run a simple GWAS analysis in the cloud" + ] + }, + { + "cell_type": "markdown", + "id": "044a04d8", + "metadata": {}, + "source": [ + "## Clean up\n", + "Make sure you shut down this VM, or delete it if you don't plan to use if further.\n", + "\n", + "You can also [delete the buckets](https://docs.aws.amazon.com/AmazonS3/latest/userguide/delete-bucket.html) if you don't want to pay for the data: `aws s3 rb s3://bucket-name --force`" + ] + }, + { + "cell_type": "markdown", + "id": "c1e7be16", + "metadata": {}, + "source": [] } ], "metadata": { diff --git a/notebooks/GenAI/Pubmed_RAG_chatbot.ipynb b/notebooks/GenAI/Pubmed_RAG_chatbot.ipynb index 09aa19a..81daf02 100644 --- a/notebooks/GenAI/Pubmed_RAG_chatbot.ipynb +++ b/notebooks/GenAI/Pubmed_RAG_chatbot.ipynb @@ -8,19 +8,50 @@ "# Creating a PubMed Chatbot with Llama2" ] }, + { + "cell_type": "markdown", + "id": "6824ed09", + "metadata": {}, + "source": [ + "## Overview" + ] + }, { "cell_type": "markdown", "id": "3ecea2ad-7c65-4367-87e1-b021167c3a1d", "metadata": {}, "source": [ - "For this tutorial we are creating a PubMed chatbot that will answer questions by gathering information from documents we have provided via a index. The model we will be using today is a pretrained Llama2 model from Jumpstart.\n", - "\n", - "This tutorial will go over the following topics:\n", + "For this tutorial we are creating a PubMed chatbot that will answer questions by gathering information from documents we have provided via an index. The model we will be using today is a pretrained Llama2 model from Jumpstart." + ] + }, + { + "cell_type": "markdown", + "id": "df0a5b23", + "metadata": {}, + "source": [ + "## Learning Objectives\n", "- Introduce langchain\n", "- Explain the differences between zero-shot, one-shot, and few-shot prompting\n", "- Practice using different document retrievers" ] }, + { + "cell_type": "markdown", + "id": "41f7a40b", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "You need access to SageMaker, Model Jumpstart, S3, and Kendra" + ] + }, + { + "cell_type": "markdown", + "id": "0da27877", + "metadata": {}, + "source": [ + "## Get Started" + ] + }, { "cell_type": "markdown", "id": "4d01e74b-b5b4-4be9-b16e-ec55419318ef", @@ -34,12 +65,12 @@ "id": "9dbd13e7-afc9-416b-94dc-418a93e14587", "metadata": {}, "source": [ - "Identify which model we want to deploy from Jumpstart, in this case we are using Llama2." + "Identify which model we want to deploy from Jumpstart, in this case we are using Llama2 with 7 billion parameters." ] }, { "cell_type": "code", - "execution_count": 1, + "execution_count": null, "id": "6b51bf71-d2e5-4afc-8569-338767b43b9c", "metadata": { "tags": [] @@ -60,49 +91,17 @@ "id": "624c9bb8-3ce2-4240-b2b8-6b1bb93bb9f2", "metadata": {}, "source": [ - "Create an endpoint so that we can communicate with our model, send inputs, and retrieve outputs." + "Create an endpoint to deploy our model locally, so that we can communicate with our model, send inputs, and retrieve outputs." ] }, { "cell_type": "code", - "execution_count": 2, + "execution_count": null, "id": "bf27747d-443f-47e7-9d2c-a8f5c5c6f3b8", "metadata": { "tags": [] }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/home/ec2-user/anaconda3/envs/python3/lib/python3.10/site-packages/pandas/core/computation/expressions.py:21: UserWarning: Pandas requires version '2.8.0' or newer of 'numexpr' (version '2.7.3' currently installed).\n", - " from pandas.core.computation.check import NUMEXPR_INSTALLED\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml\n", - "sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "For forward compatibility, pin to model_version='2.*' in your JumpStartModel or JumpStartEstimator definitions. Note that major version upgrades may have different EULA acceptance terms and input/output signatures.\n", - "For forward compatibility, pin to model_version='2.*' in your JumpStartModel or JumpStartEstimator definitions. Note that major version upgrades may have different EULA acceptance terms and input/output signatures.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "-----------------!" - ] - } - ], + "outputs": [], "source": [ "from sagemaker.jumpstart.model import JumpStartModel\n", "\n", @@ -120,7 +119,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": null, "id": "1ad71f0d-3be5-4b03-9c1c-eb4585721fc8", "metadata": { "tags": [] @@ -143,7 +142,7 @@ "id": "5a820eea-1538-4f40-86c4-eb14fe09e127", "metadata": {}, "source": [ - "Our chatbot will rely on documents to answer our questions to do so we are supplying it a **vector index**. A vector index or index is a data structure that enables fast and accurate search and retrieval of vector embeddings from a large dataset of objects. We will be working with two options for our index PubMed API vs Kendra Index." + "Our chatbot respond to prompts based on the documents we supplied. This occurs via a **vector index**. A vector index is a data structure composed of vectorized embeddings (generated from our inputs) that enables fast and accurate search and retrieval from a large dataset of objects. We will explore using two methods to generate our index: PubMed API vs Kendra Index." ] }, { @@ -153,9 +152,9 @@ "source": [ "**What is the difference?**\n", "\n", - "The **PubMed API** is provided free by langchain to connect your model to more than **35 million citations** for biomedical literature from MEDLINE, life science journals, and online books. **Kendra index** is a AWS product that allows the user more **security and control** on which documents you wish to supply to your model. \n", + "The **PubMed API** is provided free by langchain to connect your model to more than **35 million citations** for biomedical literature from MEDLINE, life science journals, and online books. **Kendra index** is an AWS product that allows the user more **security and control** on which documents you wish to supply to your model. \n", "\n", - "We will be exploring both methods to see which produces the best results!" + "We will explore both methods to see which produces the best results!" ] }, { @@ -179,7 +178,7 @@ "id": "1d1c9de7-4a06-4f85-b9ff-c8c9e51f8c70", "metadata": {}, "source": [ - "AWS marketplace has PubMed database named **PubMed CentralĀ® (PMC)** that contains free full-text archive of biomedical and life sciences journal article at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM). We will be subsetting this database to add documents to our Kendra index. Ensure that you have the correct roles and policies to allow your environment to connect to S3 buckets, SageMaker, and Kendra." + "AWS marketplace provides a PubMed database named **PubMed CentralĀ® (PMC)** that contains free full-text archive of biomedical and life sciences journal article at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM). We will subset this database to add documents to our Kendra index. Ensure that you have the correct roles and policies to allow your environment to connect to S3 buckets, SageMaker, and Kendra." ] }, { @@ -199,7 +198,7 @@ "source": [ "#make bucket\n", "bucket = 'pubmed-chat-docs'\n", - "!aws s3 mb s3://{bucket}" + "! aws s3 mb s3://{bucket}" ] }, { @@ -207,7 +206,7 @@ "id": "b6ad30ba-cee8-47f9-bc1e-ece8961ac66a", "metadata": {}, "source": [ - "We will then download the metadata file from the PMC bucket, this will list all of the articles within the PMC bucket and their bucket paths we will use this to subset the database into our own bucket." + "Next we will download the metadata file from the PMC bucket. This metadata file will list all of the articles within the PMC bucket and their paths. We will use these data to subset the database into our own bucket." ] }, { @@ -218,7 +217,7 @@ "outputs": [], "source": [ "#download the metadata file\n", - "!aws s3 cp s3://pmc-oa-opendata/oa_comm/txt/metadata/txt/oa_comm.filelist.txt . --sse" + "! aws s3 cp s3://pmc-oa-opendata/oa_comm/txt/metadata/txt/oa_comm.filelist.txt . --sse" ] }, { @@ -226,31 +225,24 @@ "id": "93a8595a-767f-4cad-9273-62d8e2cf60d1", "metadata": {}, "source": [ - "We only want the metadata of the first 100 files." + "We only want the metadata of the first 100 files to keep this tutorial short." ] }, { "cell_type": "code", - "execution_count": 1, + "execution_count": null, "id": "c26b0f29-2b07-43a6-800d-4aa5e957fe52", "metadata": { "tags": [] }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/home/ec2-user/anaconda3/envs/python3/lib/python3.10/site-packages/pandas/core/computation/expressions.py:21: UserWarning: Pandas requires version '2.8.0' or newer of 'numexpr' (version '2.7.3' currently installed).\n", - " from pandas.core.computation.check import NUMEXPR_INSTALLED\n" - ] - } - ], + "outputs": [], "source": [ "#import the file as a dataframe\n", "import pandas as pd\n", "import os\n", + "\n", "df = pd.read_csv('oa_comm.filelist.csv')\n", + "\n", "#first 100 files\n", "first_100=df[0:101]\n", "#save new metadata\n", @@ -262,7 +254,7 @@ "id": "abd1ae93-450e-4c79-83cc-ea46a1b507c1", "metadata": {}, "source": [ - "Lets look at our metadata! We can see that the bucket path to the files are under the **Key** column this is what we will use to loop through the PMC bucket and copy the first 100 files to our bucket." + "Lets look at our metadata! We can see that the bucket path to the files are under the **Key** column. This column is what we will use to loop through the PMC bucket and copy the first 100 files to our bucket." ] }, { @@ -303,7 +295,7 @@ "metadata": {}, "outputs": [], "source": [ - "!aws s3 cp oa_comm.filelist_100.csv s3://{bucket}/docs/" + "! aws s3 cp oa_comm.filelist_100.csv s3://{bucket}/docs/" ] }, { @@ -329,13 +321,13 @@ "id": "3ba2291e-109e-4120-ad10-5dbfd341a07b", "metadata": {}, "source": [ - "Inorder for us to fluidly send input and receive outputs from our chatbot we need to create a **inference script** that will format inputs in a way that the chatbot can understand and format outputs in a way we can understand. We will also be supplying instructions to the chatbot through the script.\n", + "For us to fluidly send input and receive outputs from our chatbot we need to create an [**inference script**](https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html#deploy-model-options) that will format inputs in a way that the chatbot can understand and format outputs in a way we can understand. We will also be supplying instructions to the chatbot through the script.\n", "\n", - "Our script will utilize **langchain** tools and packages to enable our model to:\n", + "Our script will utilize **LangChain** tools and packages to enable our model to:\n", "- **Connect to sources of context** (e.g. providing our model with tasks and examples)\n", "- **Rely on reason** (e.g. instruct our model how to answer based on provided context)\n", "\n", - "**Warning**: The over all inference script must be run on the terminal via the command `python YOUR_SCRIPT.py`." + "**Warning**: The inference script must be run on the terminal via the command `python YOUR_SCRIPT.py`." ] }, { @@ -734,28 +726,17 @@ "id": "1abcbd48-bb84-4310-b8eb-ad87850a8649", "metadata": {}, "source": [ - "Running our script in the terminal will require us to export the following global variables then running our python script. Dont forget to run you python script on the terminal use the command `python NAME_OF_YOUR_SCRIPT.py`. For more guidence take a look at our **example inference scripts** for the [PubMed API](/example_scripts/langchain_chat_llama_2_zeroshot.py) and [Kendra](/example_scripts/kendra_chat_llama_2.py)." + "Running our script in the terminal will require us to export the following global variables before running the python script. Don't forget to run you python script on the terminal using the command `python NAME_OF_YOUR_SCRIPT.py`. For more guidence take a look at our **example inference scripts** for the [PubMed API](/example_scripts/langchain_chat_llama_2_zeroshot.py) and [Kendra](/example_scripts/kendra_chat_llama_2.py)." ] }, { "cell_type": "code", - "execution_count": 4, + "execution_count": null, "id": "ba97df23-6893-438d-8a67-cb7dbf83e407", "metadata": { "tags": [] }, - "outputs": [ - { - "data": { - "text/plain": [ - "'meta-textgeneration-llama-2-7b-f-2023-11-21-20-18-40-341'" - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "#retreive our endpoint id\n", "endpoint_id" @@ -768,7 +749,7 @@ "metadata": {}, "outputs": [], "source": [ - "#enter the global variables in your terminal\n", + "#enter these global variables in your terminal\n", "export AWS_REGION=''\n", "export LLAMA_2_ENDPOINT=''\n", "export KENDRA_INDEX_ID=''" @@ -789,7 +770,16 @@ "id": "80c8fb4b-e74f-4e8d-892b-0f913eff747d", "metadata": {}, "source": [ - "![PubMed Chatbot Results](../../../docs/images/PubMed_chatbot_results.png)" + "![PubMed Chatbot Results](../../docs/images/PubMed_chatbot_results.png)" + ] + }, + { + "cell_type": "markdown", + "id": "ddeee006", + "metadata": {}, + "source": [ + "## Conclusions\n", + "Here you learned how to deploy a model, create a vector database (index) from PubMed documents, and then interact with a model to product predictions using an inference script. " ] }, { @@ -810,22 +800,10 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": null, "id": "c307bb17-757a-4579-a0d8-698eb1bb3f2e", "metadata": {}, - "outputs": [ - { - "ename": "NameError", - "evalue": "name 'endpoint' is not defined", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", - "Cell \u001b[0;32mIn[2], line 3\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m#Delete model and endpoint\u001b[39;00m\n\u001b[1;32m 2\u001b[0m \u001b[38;5;66;03m#model.delete()\u001b[39;00m\n\u001b[0;32m----> 3\u001b[0m \u001b[43mendpoint\u001b[49m\u001b[38;5;241m.\u001b[39mdelete()\n", - "\u001b[0;31mNameError\u001b[0m: name 'endpoint' is not defined" - ] - } - ], + "outputs": [], "source": [ "#Delete model and endpoint\n", "model.delete()\n", @@ -840,7 +818,7 @@ "outputs": [], "source": [ "#Delete bucket\n", - "!aws s3 rb s3://$bucket --force " + "! aws s3 rb s3://$bucket --force " ] } ], diff --git a/notebooks/SRADownload/SRA-Download.ipynb b/notebooks/SRADownload/SRA-Download.ipynb index 3f36742..0866d68 100644 --- a/notebooks/SRADownload/SRA-Download.ipynb +++ b/notebooks/SRADownload/SRA-Download.ipynb @@ -18,21 +18,49 @@ "DNA sequence data are typically deposited into the NCBI Sequence Read Archive, and can be accessed through the SRA website, or via a collection of command line tools called SRA Toolkit. Individual sequence entries are assigned an Accession ID, which can be used to find and download a particular file. For example, if you go to the [SRA database](https://www.ncbi.nlm.nih.gov/sra) in a browser window, and search for `SRX15695630`, you should see an entry for _C. elegans_. Alternatively, you can search the SRA metadata using Amazon Athena and generate a list of accession numbers. Here we are going to generate a list of accessions using Athena, use tools from the SRA Toolkit to download a few fastq files, then copy those fastq files to a cloud bucket. We really only scratch the surface of how to search Athena using SQL. If you want more examples, you can also try the notebooks from [this SRA GitHub repo](https://github.com/ncbi/ASHG-Workshop-2021). " ] }, + { + "cell_type": "markdown", + "id": "989d4270", + "metadata": {}, + "source": [ + "## Learning Objectives\n", + "+ Learn more about the Sequence Read Archive\n", + "+ Learn how to download SRA data locally\n", + "+ Learn how to interact with SRA metadata via Athena" + ] + }, + { + "cell_type": "markdown", + "id": "0b19079d", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "+ Make sure you have access to SageMaker, Athena and Glue" + ] + }, + { + "cell_type": "markdown", + "id": "877bcdc1", + "metadata": {}, + "source": [ + "## Get Started" + ] + }, { "cell_type": "markdown", "id": "39f62f42", "metadata": {}, "source": [ - "### 1) Set up your Athena Database\n", - "You need to set up your Athena database in the Athena console before you start this notebook. Follow our [guide](https://github.com/STRIDES/NIHCloudLabAWS/blob/main/docs/create_athena_database.md) to walk you through it." + "### Set up environment and install dependencies" ] }, { "cell_type": "markdown", - "id": "7aed7098", + "id": "aadfbc50", "metadata": {}, "source": [ - "### 2) Install Dependencies" + "### Set up your Athena Database\n", + "You need to set up your Athena database in the Athena console before you start this notebook. Follow our [guide](https://github.com/STRIDES/NIHCloudLabAWS/blob/main/docs/create_athena_database.md) to walk you through it." ] }, { @@ -43,6 +71,30 @@ "At the time of writing, the version of SRA tools available with the Anaconda distribution was v.2.11.0. If you want to install the latest version, download and install from [here](https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit). If you do the direct install, you will also need to configure interactively following [this guide](https://github.com/ncbi/sra-tools/wiki/05.-Toolkit-Configuration), you can do that by opening a terminal and running the commands there." ] }, + { + "cell_type": "code", + "execution_count": null, + "id": "5aa7fc7d", + "metadata": {}, + "outputs": [], + "source": [ + "# install mamba\n", + "! curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh\n", + "! bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0e57ca51", + "metadata": {}, + "outputs": [], + "source": [ + "# add to your path\n", + "import os\n", + "os.environ[\"PATH\"] += os.pathsep + os.environ[\"HOME\"]+\"/mambaforge/bin\"" + ] + }, { "cell_type": "code", "execution_count": null, @@ -50,6 +102,7 @@ "metadata": {}, "outputs": [], "source": [ + "# install everything else\n", "! mamba install -c bioconda -c conda-forge sra-tools==2.11.0 sql-magic pyathena -y" ] }, @@ -68,7 +121,7 @@ "metadata": {}, "outputs": [], "source": [ - "!fasterq-dump -h" + "! fasterq-dump -h" ] }, { @@ -76,7 +129,7 @@ "id": "ddc46609", "metadata": {}, "source": [ - "### 3) Setup Directory Structure and Create a Staging Bucket" + "### Setup Directory Structure and Create a Staging Bucket" ] }, { @@ -91,18 +144,10 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": null, "id": "827f2447", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "/home/ec2-user/SageMaker/NIHCloudLabAWS/tutorials/notebooks/SRADownload/data\n" - ] - } - ], + "outputs": [], "source": [ "cd data/" ] @@ -114,7 +159,7 @@ "metadata": {}, "outputs": [], "source": [ - "#make sure you change this name, it needs to be globally unique\n", + "# make sure you change this name, it needs to be globally unique\n", "%env BUCKET=sra-data-athena" ] }, @@ -135,7 +180,7 @@ "id": "086a50c1", "metadata": {}, "source": [ - "### 4) Create Accession List using Athena" + "### Create a list of SRA accessions using Athena" ] }, { @@ -161,7 +206,7 @@ "metadata": {}, "outputs": [], "source": [ - "#import packages\n", + "# import packages\n", "from pyathena import connect\n", "import pandas as pd" ] @@ -191,6 +236,7 @@ "metadata": {}, "source": [ "**When you run the query in the next cell you may get this error**:\n", + "\n", "`An error occurred (AccessDeniedException) when calling the StartQueryExecution operation: User: arn:aws:sts::055102001469:assumed-role/sagemaker-notebook-instance-role/SageMaker is not authorized to perform: athena:StartQueryExecution on resource: arn:aws:athena:us-east-1:055102001469:workgroup/primary because no identity-based policy allows the athena:StartQueryExecution action`\n", "\n", "If you get this error, read our [IAM guide](https://github.com/STRIDES/NIHCloudLabAWS/blob/main/docs/update_sagemaker_role.md) to set up the correct policy for your Sagemaker role. \n" @@ -201,7 +247,7 @@ "id": "5830d8e3", "metadata": {}, "source": [ - "Now that the permissions are all set up, let's download bacterial samples. You could change the SQL query as you like, feel free to take a look at the generated df, and then play with different parameters. For more inspiration of what is possible with SQL queries, look at this [SRA tutorial](https://github.com/ncbi/ASHG-Workshop-2021/blob/main/3_Biology_Example_AWS_Demo.ipynb)." + "Now that the permissions are all set up, let's download bacterial samples from the [NIGMS Sandbox RNAseq tutorial](https://github.com/NIGMS/RNA-Seq-Differential-Expression-Analysis). You could change the SQL query as you like, feel free to take a look at the generated df, and then play with different parameters. For more inspiration of what is possible with SQL queries, look at this [SRA tutorial](https://github.com/ncbi/ASHG-Workshop-2021/blob/main/3_Biology_Example_AWS_Demo.ipynb)." ] }, { @@ -228,7 +274,7 @@ "id": "d3511937", "metadata": {}, "source": [ - "As you can see, most of what you need to know is shown in this data frame. If you wanted to just show the accession, you could replace the * for acc in the SELECT command. One other thing to think about is how large are these files, and do you have space on your VM to download them? You can figure this out by looking at the 'jattr' column, and then converting the number of bites to GB, then add that for a few samples to get a ballpark figure. If you need more space, stop the VM, go Compute Engine and either [resize your disk](https://aws.amazon.com/blogs/machine-learning/customize-your-notebook-volume-size-up-to-16-tb-with-amazon-sagemaker/). Make sure you stop your notebook instance to Edit and resize it. You can see the amount of space on your disk from the command line using `!df -h .`" + "As you can see, most of what you need to know is shown in this data frame. If you wanted to just show the accession, you could replace the * for acc in the SELECT command. One other thing to think about is how large are these files, and do you have space on your VM to download them? You can figure this out by looking at the 'jattr' column, and then converting the number of bites to GB, then add that for a few samples to get a ballpark figure. If you need more space, stop the VM, go to SageMaker and [resize your disk](https://aws.amazon.com/blogs/machine-learning/customize-your-notebook-volume-size-up-to-16-tb-with-amazon-sagemaker/). Make sure you stop your notebook instance to Edit and resize it. You can see the amount of space on your disk from the command line using `! df -h .`" ] }, { @@ -256,7 +302,7 @@ "metadata": {}, "outputs": [], "source": [ - "!vdb-dump --info SRR13349124 " + "! vdb-dump --info SRR13349124 " ] }, { @@ -266,7 +312,7 @@ "metadata": {}, "outputs": [], "source": [ - "!srapath SRR13349124" + "! srapath SRR13349124" ] }, { @@ -296,7 +342,7 @@ "metadata": {}, "outputs": [], "source": [ - "cat list_of_accessionIDS.txt" + "! cat list_of_accessionIDS.txt" ] }, { @@ -304,7 +350,7 @@ "id": "01437b57", "metadata": {}, "source": [ - "### 5) Download FASTQ files with fasterq dump" + "### Download FASTQ files with fasterq dump" ] }, { @@ -317,18 +363,10 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": null, "id": "4764f355", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "/home/ec2-user/SageMaker/NIHCloudLabAWS/tutorials/notebooks/SRADownload/data/fasterqdump\n" - ] - } - ], + "outputs": [], "source": [ "cd fasterqdump/" ] @@ -343,33 +381,13 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": null, "id": "80c2e3b4", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "spots read : 2,054,166\n", - "reads read : 4,108,332\n", - "reads written : 4,108,332\n", - "spots read : 25,734,849\n", - "reads read : 51,469,698\n", - "reads written : 25,734,849\n", - "reads 0-length : 25,734,849\n", - "spots read : 18,624,005\n", - "reads read : 37,248,010\n", - "reads written : 18,624,005\n", - "reads 0-length : 18,624,005\n", - "CPU times: user 6.18 s, sys: 1.26 s, total: 7.44 s\n", - "Wall time: 6min 36s\n" - ] - } - ], + "outputs": [], "source": [ "%%time\n", - "!for x in `cat ../list_of_accessionIDS.txt`; do fasterq-dump -f -O raw_fastq -e 8 -m 4G $x ; done" + "! for x in `cat ../list_of_accessionIDS.txt`; do fasterq-dump -f -O raw_fastq -e 8 -m 4G $x ; done" ] }, { @@ -385,7 +403,7 @@ "id": "55bd52cd", "metadata": {}, "source": [ - "### 6) Download FASTQ files with prefetch + fasterq dump" + "### Download FASTQ files with prefetch + fasterq dump" ] }, { @@ -398,76 +416,31 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": null, "id": "ddefec2d", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "/home/ec2-user/SageMaker/NIHCloudLabAWS/tutorials/notebooks/SRADownload/data/prefetch_fasterqdump\n" - ] - } - ], + "outputs": [], "source": [ "cd ../prefetch_fasterqdump" ] }, { "cell_type": "code", - "execution_count": 10, + "execution_count": null, "id": "935f6ca2", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "2022-08-30T15:45:12 prefetch.2.11.0: 1) Downloading 'SRR3617061'...\n", - "2022-08-30T15:45:12 prefetch.2.11.0: Downloading via HTTPS...\n", - "2022-08-30T15:45:16 prefetch.2.11.0: HTTPS download succeed\n", - "2022-08-30T15:45:17 prefetch.2.11.0: 'SRR3617061' is valid\n", - "2022-08-30T15:45:17 prefetch.2.11.0: 1) 'SRR3617061' was downloaded successfully\n", - "\n", - "2022-08-30T15:45:17 prefetch.2.11.0: 2) Downloading 'SRR8435254'...\n", - "2022-08-30T15:45:17 prefetch.2.11.0: Downloading via HTTPS...\n", - "2022-08-30T15:45:23 prefetch.2.11.0: HTTPS download succeed\n", - "2022-08-30T15:45:24 prefetch.2.11.0: 'SRR8435254' is valid\n", - "2022-08-30T15:45:24 prefetch.2.11.0: 2) 'SRR8435254' was downloaded successfully\n", - "2022-08-30T15:45:24 prefetch.2.11.0: 'SRR8435254' has 0 dependencies\n", - "\n", - "2022-08-30T15:45:24 prefetch.2.11.0: 3) Downloading 'SRR8435252'...\n", - "2022-08-30T15:45:24 prefetch.2.11.0: Downloading via HTTPS...\n", - "2022-08-30T15:45:28 prefetch.2.11.0: HTTPS download succeed\n", - "2022-08-30T15:45:29 prefetch.2.11.0: 'SRR8435252' is valid\n", - "2022-08-30T15:45:29 prefetch.2.11.0: 3) 'SRR8435252' was downloaded successfully\n", - "2022-08-30T15:45:29 prefetch.2.11.0: 'SRR8435252' has 0 dependencies\n", - "CPU times: user 290 ms, sys: 37.5 ms, total: 327 ms\n", - "Wall time: 17 s\n" - ] - } - ], + "outputs": [], "source": [ "%%time\n", - "!prefetch --option-file ../list_of_accessionIDS.txt -O raw_fastq -f yes" + "! prefetch --option-file ../list_of_accessionIDS.txt -O raw_fastq -f yes" ] }, { "cell_type": "code", - "execution_count": 13, + "execution_count": null, "id": "7eece75e", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\u001b[0m\u001b[01;34mSRR3617061\u001b[0m/ \u001b[01;34mSRR8435252\u001b[0m/ \u001b[01;34mSRR8435254\u001b[0m/\n" - ] - } - ], + "outputs": [], "source": [ "ls raw_fastq/" ] @@ -482,33 +455,13 @@ }, { "cell_type": "code", - "execution_count": 24, + "execution_count": null, "id": "1852a71a", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "spots read : 2,054,166\n", - "reads read : 4,108,332\n", - "reads written : 4,108,332\n", - "spots read : 25,734,849\n", - "reads read : 51,469,698\n", - "reads written : 25,734,849\n", - "reads 0-length : 25,734,849\n", - "spots read : 18,624,005\n", - "reads read : 37,248,010\n", - "reads written : 18,624,005\n", - "reads 0-length : 18,624,005\n", - "CPU times: user 1.49 s, sys: 308 ms, total: 1.8 s\n", - "Wall time: 1min 38s\n" - ] - } - ], + "outputs": [], "source": [ "%%time\n", - "!for x in `cat ../list_of_accessionIDS.txt`; do fasterq-dump -f -O raw_fastq -e 8 -m 4G raw_fastq/$x; done" + "! for x in `cat ../list_of_accessionIDS.txt`; do fasterq-dump -f -O raw_fastq -e 8 -m 4G raw_fastq/$x; done" ] }, { @@ -524,7 +477,7 @@ "id": "ea152fd7", "metadata": {}, "source": [ - "### Step 7) Copy Files to a Bucket" + "### Copy Files to a Bucket" ] }, { @@ -537,50 +490,31 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": null, "id": "ad73308f", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "upload: raw_fastq/SRR3617061/SRR3617061.sra to s3://sra-data-athena/raw_fastq/SRR3617061/SRR3617061.sra\n", - "upload: raw_fastq/SRR8435252/SRR8435252.sra to s3://sra-data-athena/raw_fastq/SRR8435252/SRR8435252.sra\n", - "upload: raw_fastq/SRR3617061_2.fastq to s3://sra-data-athena/raw_fastq/SRR3617061_2.fastq\n", - "upload: raw_fastq/SRR3617061_1.fastq to s3://sra-data-athena/raw_fastq/SRR3617061_1.fastq\n", - "upload: raw_fastq/SRR8435254/SRR8435254.sra to s3://sra-data-athena/raw_fastq/SRR8435254/SRR8435254.sra\n", - "upload: raw_fastq/SRR8435252.fastq to s3://sra-data-athena/raw_fastq/SRR8435252.fastq\n", - "upload: raw_fastq/SRR8435254.fastq to s3://sra-data-athena/raw_fastq/SRR8435254.fastq\n" - ] - } - ], + "outputs": [], "source": [ - "!aws s3 cp raw_fastq/*.fastq s3://sra-data-athena/raw_fastq --recursive" + "! aws s3 cp raw_fastq/*.fastq s3://sra-data-athena/raw_fastq --recursive" ] }, { "cell_type": "code", - "execution_count": 25, + "execution_count": null, "id": "072ebc9a", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " PRE SRR3617061/\n", - " PRE SRR8435252/\n", - " PRE SRR8435254/\n", - "2022-08-30 15:53:41 722868342 SRR3617061_1.fastq\n", - "2022-08-30 15:53:41 722868342 SRR3617061_2.fastq\n", - "2022-08-30 15:53:42 3903844648 SRR8435252.fastq\n", - "2022-08-30 15:53:56 5411343576 SRR8435254.fastq\n" - ] - } - ], + "outputs": [], + "source": [ + "! aws s3 ls s3://sra-data-athena/raw_fastq/" + ] + }, + { + "cell_type": "markdown", + "id": "4d5fda05", + "metadata": {}, "source": [ - "!aws s3 ls s3://sra-data-athena/raw_fastq/" + "## Conclusions \n", + "Here you learned how to generate a list of accessions using Athena and then use the SRA toolkit to download FASTQ files. We tested fasterq-dump on its own, but found that using prefetch then fasterq-dump is much faster. Finally you learned how to copy directories to S3 using the `--recursive` flag." ] }, { @@ -588,9 +522,17 @@ "id": "a4026566", "metadata": {}, "source": [ - "### Step 8) Clean up\n", - "Make sure you shut down this VM, or delete it if you don't plan to use if further. You can also [delete the buckets](https://docs.aws.amazon.com/AmazonS3/latest/userguide/delete-bucket.html) if you don't want to pay for the data: `aws s3 rb s3://bucket-name --force`" + "## Clean up\n", + "Make sure you shut down this VM, or delete it if you don't plan to use if further.\n", + "\n", + "You can also [delete the buckets](https://docs.aws.amazon.com/AmazonS3/latest/userguide/delete-bucket.html) if you don't want to pay for the data: `aws s3 rb s3://bucket-name --force`" ] + }, + { + "cell_type": "markdown", + "id": "9f3bbfe3", + "metadata": {}, + "source": [] } ], "metadata": { @@ -601,9 +543,9 @@ "uri": "gcr.io/deeplearning-platform-release/base-cpu:m93" }, "kernelspec": { - "display_name": "conda_python3", + "display_name": "Python 3", "language": "python", - "name": "conda_python3" + "name": "python3" }, "language_info": { "codemirror_mode": { @@ -615,7 +557,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.12" + "version": "3.9.6" } }, "nbformat": 4, diff --git a/notebooks/SpleenLiverSegmentation/SpleenSeg_Pretrained-4_27.ipynb b/notebooks/SpleenLiverSegmentation/SpleenSeg_Pretrained-4_27.ipynb index 48b8141..166ebd1 100644 --- a/notebooks/SpleenLiverSegmentation/SpleenSeg_Pretrained-4_27.ipynb +++ b/notebooks/SpleenLiverSegmentation/SpleenSeg_Pretrained-4_27.ipynb @@ -5,33 +5,61 @@ "id": "1452463e", "metadata": {}, "source": [ - "## Spleen Model With NVIDIA Pretrain\n", - "- Uses Unet architecture\n", - "- Pretrained model at: https://ngc.nvidia.com/catalog/models/nvidia:med:clara_pt_spleen_ct_segmentation" + "# Spleen Segmentation With NVIDIA Pretrained Model\n", + "\n", + "## Overview\n", + "NVIDIA offers pre-trained models that can be downloaded and applied to various tasks. Here we will do Spleen segmentation using a pretrained model with a Unet architecture found [here](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/monaitoolkit/models/monai_spleen_ct_segmentation)" + ] + }, + { + "cell_type": "markdown", + "id": "170d7954", + "metadata": {}, + "source": [ + "## Learning Objectives\n", + "+ Learn how to interact with NVIDIA pre trained models\n", + "+ Learn how to conduct medical image segmentation" + ] + }, + { + "cell_type": "markdown", + "id": "bb7c2472", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "You only need access to a Sagemaker environment, and make sure you are runnning a GPU. If not, stop your instance and resize to a T4 GPU machine." ] }, { "cell_type": "markdown", - "id": "f59ba435", + "id": "93aa3269", "metadata": {}, "source": [ - "##### Uncomment below to install all dependencies" + "## Get Started" + ] + }, + { + "cell_type": "markdown", + "id": "a66cdaf2", + "metadata": {}, + "source": [ + "### Install packages" ] }, { "cell_type": "code", - "execution_count": 1, + "execution_count": null, "id": "82db674f", "metadata": {}, "outputs": [], "source": [ - "#!pip install 'monai[all]'\n", - "#!pip install matplotlib " + "! pip install 'monai[all]'\n", + "! pip install matplotlib " ] }, { "cell_type": "code", - "execution_count": 2, + "execution_count": null, "id": "bb1228b3", "metadata": {}, "outputs": [], @@ -41,7 +69,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": null, "id": "540e5d47", "metadata": {}, "outputs": [], @@ -65,42 +93,10 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": null, "id": "07510582", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "MONAI version: 0.8.1\n", - "Numpy version: 1.21.1\n", - "Pytorch version: 1.9.0\n", - "MONAI flags: HAS_EXT = False, USE_COMPILED = False\n", - "MONAI rev id: 71ff399a3ea07aef667b23653620a290364095b1\n", - "\n", - "Optional dependencies:\n", - "Pytorch Ignite version: 0.4.8\n", - "Nibabel version: 3.2.1\n", - "scikit-image version: 0.18.2\n", - "Pillow version: 8.3.1\n", - "Tensorboard version: 2.5.0\n", - "gdown version: 3.13.0\n", - "TorchVision version: 0.10.0+cu111\n", - "tqdm version: 4.61.2\n", - "lmdb version: 1.2.1\n", - "psutil version: 5.8.0\n", - "pandas version: 1.3.0\n", - "einops version: 0.3.0\n", - "transformers version: 4.18.0\n", - "mlflow version: 1.25.1\n", - "\n", - "For details about installing the optional dependencies, please visit:\n", - " https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies\n", - "\n" - ] - } - ], + "outputs": [], "source": [ "import os\n", "import tempfile\n", @@ -157,7 +153,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": null, "id": "0be7401d", "metadata": {}, "outputs": [], @@ -175,18 +171,10 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": null, "id": "311c3282", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "monai_data/\n" - ] - } - ], + "outputs": [], "source": [ "directory = \"monai_data/\"\n", "root_dir = tempfile.mkdtemp() if directory is None else directory\n", @@ -203,20 +191,10 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": null, "id": "da7cfede", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "2022-04-27 14:49:41,401 - INFO - Verified 'Task09_Spleen.tar', md5: 410d4a301da4e5b2f6f86ec3ddba524e.\n", - "2022-04-27 14:49:41,402 - INFO - File exists: monai_data/Task09_Spleen.tar, skipped downloading.\n", - "2022-04-27 14:49:41,403 - INFO - Non-empty folder exists in monai_data/Task09_Spleen, skipped extracting.\n" - ] - } - ], + "outputs": [], "source": [ "resource = \"https://msd-for-monai.s3-us-west-2.amazonaws.com/Task09_Spleen.tar\"\n", "md5 = \"410d4a301da4e5b2f6f86ec3ddba524e\"\n", @@ -236,7 +214,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": null, "id": "2515b177", "metadata": {}, "outputs": [], @@ -262,7 +240,7 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": null, "id": "2357d35d", "metadata": {}, "outputs": [], @@ -328,38 +306,10 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": null, "id": "ada5757a", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[{'image': 'monai_data/Task09_Spleen/imagesTr/spleen_56.nii.gz',\n", - " 'label': 'monai_data/Task09_Spleen/labelsTr/spleen_56.nii.gz'},\n", - " {'image': 'monai_data/Task09_Spleen/imagesTr/spleen_59.nii.gz',\n", - " 'label': 'monai_data/Task09_Spleen/labelsTr/spleen_59.nii.gz'},\n", - " {'image': 'monai_data/Task09_Spleen/imagesTr/spleen_6.nii.gz',\n", - " 'label': 'monai_data/Task09_Spleen/labelsTr/spleen_6.nii.gz'},\n", - " {'image': 'monai_data/Task09_Spleen/imagesTr/spleen_60.nii.gz',\n", - " 'label': 'monai_data/Task09_Spleen/labelsTr/spleen_60.nii.gz'},\n", - " {'image': 'monai_data/Task09_Spleen/imagesTr/spleen_61.nii.gz',\n", - " 'label': 'monai_data/Task09_Spleen/labelsTr/spleen_61.nii.gz'},\n", - " {'image': 'monai_data/Task09_Spleen/imagesTr/spleen_62.nii.gz',\n", - " 'label': 'monai_data/Task09_Spleen/labelsTr/spleen_62.nii.gz'},\n", - " {'image': 'monai_data/Task09_Spleen/imagesTr/spleen_63.nii.gz',\n", - " 'label': 'monai_data/Task09_Spleen/labelsTr/spleen_63.nii.gz'},\n", - " {'image': 'monai_data/Task09_Spleen/imagesTr/spleen_8.nii.gz',\n", - " 'label': 'monai_data/Task09_Spleen/labelsTr/spleen_8.nii.gz'},\n", - " {'image': 'monai_data/Task09_Spleen/imagesTr/spleen_9.nii.gz',\n", - " 'label': 'monai_data/Task09_Spleen/labelsTr/spleen_9.nii.gz'}]" - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "val_files" ] @@ -374,30 +324,10 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": null, "id": "689eea4e", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "image shape: torch.Size([239, 239, 113]), label shape: torch.Size([239, 239, 113])\n" - ] - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "check_ds = Dataset(data=val_files, transform=val_transforms)\n", "check_loader = DataLoader(check_ds, batch_size=1)\n", @@ -427,74 +357,10 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": null, "id": "fe3285d0", "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 32/32 [00:00<00:00, 57113.93it/s]\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Accessing lmdb file: /home/jupyter/covid19det-kaggle/kaggle/MonaiTesting/monai_data/monai_cache.lmdb.\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 32/32 [00:00<00:00, 47679.48it/s]\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'map_addr': 0, 'map_size': 1099511627776, 'last_pgno': 941102, 'last_txnid': 100, 'max_readers': 126, 'num_readers': 0, 'size': 32, 'filename': '/home/jupyter/covid19det-kaggle/kaggle/MonaiTesting/monai_data/monai_cache.lmdb'}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 9/9 [00:00<00:00, 10999.05it/s]\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Accessing lmdb file: /home/jupyter/covid19det-kaggle/kaggle/MonaiTesting/monai_data/monai_cache.lmdb.\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 9/9 [00:00<00:00, 17739.07it/s]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'map_addr': 0, 'map_size': 1099511627776, 'last_pgno': 941102, 'last_txnid': 100, 'max_readers': 126, 'num_readers': 0, 'size': 9, 'filename': '/home/jupyter/covid19det-kaggle/kaggle/MonaiTesting/monai_data/monai_cache.lmdb'}\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "\n" - ] - } - ], + "outputs": [], "source": [ "train_ds = LMDBDataset(data=train_files, transform=train_transforms, cache_dir=root_dir)\n", "# initialize cache and print meta information\n", @@ -513,18 +379,10 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": null, "id": "455cbcdc", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'map_addr': 0, 'map_size': 1099511627776, 'last_pgno': 941102, 'last_txnid': 100, 'max_readers': 126, 'num_readers': 0, 'size': 32, 'filename': '/home/jupyter/covid19det-kaggle/kaggle/MonaiTesting/monai_data/monai_cache.lmdb'}\n" - ] - } - ], + "outputs": [], "source": [ "print(train_ds.info())" ] @@ -539,7 +397,7 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": null, "id": "8539fb7d", "metadata": {}, "outputs": [], @@ -558,50 +416,20 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": null, "id": "de7fb262", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'clara_pt_spleen_ct_segmentation'" - ] - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "mmar['name']" ] }, { "cell_type": "code", - "execution_count": 16, + "execution_count": null, "id": "bf96f9f9", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "using a pretrained model.\n", - "2022-04-27 14:49:45,704 - INFO - Expected md5 is None, skip md5 check for file monai_data/clara_pt_spleen_ct_segmentation_2.zip.\n", - "2022-04-27 14:49:45,705 - INFO - File exists: monai_data/clara_pt_spleen_ct_segmentation_2.zip, skipped downloading.\n", - "2022-04-27 14:49:45,706 - INFO - Non-empty folder exists in monai_data/clara_pt_spleen_ct_segmentation, skipped extracting.\n", - "2022-04-27 14:49:45,707 - INFO - \n", - "*** \"clara_pt_spleen_ct_segmentation\" available at monai_data/clara_pt_spleen_ct_segmentation.\n", - "2022-04-27 14:49:49,353 - INFO - *** Model: \n", - "2022-04-27 14:49:49,400 - INFO - *** Model params: {'dimensions': 3, 'in_channels': 1, 'out_channels': 2, 'channels': [16, 32, 64, 128, 256], 'strides': [2, 2, 2, 2], 'num_res_units': 2, 'norm': 'batch'}\n", - "2022-04-27 14:49:49,411 - INFO - \n", - "---\n", - "2022-04-27 14:49:49,412 - INFO - For more information, please visit https://ngc.nvidia.com/catalog/models/nvidia:med:clara_pt_spleen_ct_segmentation\n", - "\n" - ] - } - ], + "outputs": [], "source": [ "device = torch.device(\"cuda:0\" if torch.cuda.is_available() else \"cpu\") #torch.device(\"cpu\")\n", "if PRETRAINED:\n", @@ -646,32 +474,10 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": null, "id": "4be7eb8f", "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 1/1 [00:00<00:00, 4639.72it/s]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Accessing lmdb file: /home/jupyter/covid19det-kaggle/kaggle/MonaiTesting/monai_data/monai_cache.lmdb.\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "\n" - ] - } - ], + "outputs": [], "source": [ "test_file = data_dicts[20:21]\n", "test_ds = LMDBDataset(data=test_file, transform=None, cache_dir=root_dir)" @@ -687,7 +493,7 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": null, "id": "16fd4e94", "metadata": {}, "outputs": [], @@ -712,33 +518,10 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": null, "id": "9782ec96", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 19, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "fig = plt.figure(frameon=False, figsize=(7,7))\n", "plt.title('Actual Spleen')\n", @@ -747,33 +530,10 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": null, "id": "76cd38e6", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 20, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "fig = plt.figure(frameon=False, figsize=(7,7))\n", "plt.title('Pretrained CalculatedSpleen')\n", @@ -782,33 +542,10 @@ }, { "cell_type": "code", - "execution_count": 21, + "execution_count": null, "id": "65c68242", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 21, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "fig = plt.figure(frameon=False, figsize=(7,7))\n", "plt.title('Differences Between Actual and Model')\n", @@ -837,7 +574,7 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": null, "id": "a8ad6aee", "metadata": {}, "outputs": [], @@ -848,526 +585,10 @@ }, { "cell_type": "code", - "execution_count": 23, + "execution_count": null, "id": "d91d340c", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "----------\n", - "epoch 1/25\n", - "1/16, train_loss: 0.8680\n", - "2/16, train_loss: 0.3699\n", - "3/16, train_loss: 0.3849\n", - "4/16, train_loss: 0.1306\n", - "5/16, train_loss: 0.2781\n", - "6/16, train_loss: 0.3628\n", - "7/16, train_loss: 0.3609\n", - "8/16, train_loss: 0.1828\n", - "9/16, train_loss: 0.1493\n", - "10/16, train_loss: 0.5063\n", - "11/16, train_loss: 0.2929\n", - "12/16, train_loss: 0.2826\n", - "13/16, train_loss: 0.2017\n", - "14/16, train_loss: 0.2591\n", - "15/16, train_loss: 0.2568\n", - "16/16, train_loss: 0.2385\n", - "epoch 1 average loss: 0.3203\n", - "----------\n", - "epoch 2/25\n", - "1/16, train_loss: 0.3457\n", - "2/16, train_loss: 0.2234\n", - "3/16, train_loss: 0.3443\n", - "4/16, train_loss: 0.0816\n", - "5/16, train_loss: 0.2259\n", - "6/16, train_loss: 0.1580\n", - "7/16, train_loss: 0.2593\n", - "8/16, train_loss: 0.1651\n", - "9/16, train_loss: 0.1124\n", - "10/16, train_loss: 0.4822\n", - "11/16, train_loss: 0.2900\n", - "12/16, train_loss: 0.2571\n", - "13/16, train_loss: 0.1799\n", - "14/16, train_loss: 0.1984\n", - "15/16, train_loss: 0.2286\n", - "16/16, train_loss: 0.2216\n", - "epoch 2 average loss: 0.2359\n", - "saved new best metric model\n", - "current epoch: 2 current mean dice: 0.8615\n", - "best mean dice: 0.8615 at epoch: 2\n", - "----------\n", - "epoch 3/25\n", - "1/16, train_loss: 0.3400\n", - "2/16, train_loss: 0.2297\n", - "3/16, train_loss: 0.3453\n", - "4/16, train_loss: 0.0822\n", - "5/16, train_loss: 0.2285\n", - "6/16, train_loss: 0.1213\n", - "7/16, train_loss: 0.2370\n", - "8/16, train_loss: 0.1607\n", - "9/16, train_loss: 0.1065\n", - "10/16, train_loss: 0.4543\n", - "11/16, train_loss: 0.2848\n", - "12/16, train_loss: 0.2848\n", - "13/16, train_loss: 0.1763\n", - "14/16, train_loss: 0.1748\n", - "15/16, train_loss: 0.4361\n", - "16/16, train_loss: 0.2234\n", - "epoch 3 average loss: 0.2429\n", - "----------\n", - "epoch 4/25\n", - "1/16, train_loss: 0.3328\n", - "2/16, train_loss: 0.2447\n", - "3/16, train_loss: 0.3436\n", - "4/16, train_loss: 0.0723\n", - "5/16, train_loss: 0.2213\n", - "6/16, train_loss: 0.1676\n", - "7/16, train_loss: 0.2672\n", - "8/16, train_loss: 0.2121\n", - "9/16, train_loss: 0.1122\n", - "10/16, train_loss: 0.5265\n", - "11/16, train_loss: 0.2810\n", - "12/16, train_loss: 0.2688\n", - "13/16, train_loss: 0.1795\n", - "14/16, train_loss: 0.1853\n", - "15/16, train_loss: 0.2458\n", - "16/16, train_loss: 0.2314\n", - "epoch 4 average loss: 0.2433\n", - "saved new best metric model\n", - "current epoch: 4 current mean dice: 0.8744\n", - "best mean dice: 0.8744 at epoch: 4\n", - "----------\n", - "epoch 5/25\n", - "1/16, train_loss: 0.3378\n", - "2/16, train_loss: 0.2047\n", - "3/16, train_loss: 0.3350\n", - "4/16, train_loss: 0.0583\n", - "5/16, train_loss: 0.2161\n", - "6/16, train_loss: 0.1008\n", - "7/16, train_loss: 0.2325\n", - "8/16, train_loss: 0.1629\n", - "9/16, train_loss: 0.1037\n", - "10/16, train_loss: 0.4499\n", - "11/16, train_loss: 0.2763\n", - "12/16, train_loss: 0.2321\n", - "13/16, train_loss: 0.1702\n", - "14/16, train_loss: 0.1652\n", - "15/16, train_loss: 0.2206\n", - "16/16, train_loss: 0.2169\n", - "epoch 5 average loss: 0.2177\n", - "----------\n", - "epoch 6/25\n", - "1/16, train_loss: 0.3303\n", - "2/16, train_loss: 0.1888\n", - "3/16, train_loss: 0.3331\n", - "4/16, train_loss: 0.0535\n", - "5/16, train_loss: 0.2149\n", - "6/16, train_loss: 0.0962\n", - "7/16, train_loss: 0.2267\n", - "8/16, train_loss: 0.1555\n", - "9/16, train_loss: 0.0995\n", - "10/16, train_loss: 0.4476\n", - "11/16, train_loss: 0.2751\n", - "12/16, train_loss: 0.2215\n", - "13/16, train_loss: 0.1644\n", - "14/16, train_loss: 0.1603\n", - "15/16, train_loss: 0.2159\n", - "16/16, train_loss: 0.2141\n", - "epoch 6 average loss: 0.2123\n", - "saved new best metric model\n", - "current epoch: 6 current mean dice: 0.8952\n", - "best mean dice: 0.8952 at epoch: 6\n", - "----------\n", - "epoch 7/25\n", - "1/16, train_loss: 0.3286\n", - "2/16, train_loss: 0.1815\n", - "3/16, train_loss: 0.3317\n", - "4/16, train_loss: 0.0487\n", - "5/16, train_loss: 0.2127\n", - "6/16, train_loss: 0.0926\n", - "7/16, train_loss: 0.2236\n", - "8/16, train_loss: 0.1536\n", - "9/16, train_loss: 0.0955\n", - "10/16, train_loss: 0.4468\n", - "11/16, train_loss: 0.2730\n", - "12/16, train_loss: 0.2171\n", - "13/16, train_loss: 0.1616\n", - "14/16, train_loss: 0.1565\n", - "15/16, train_loss: 0.2147\n", - "16/16, train_loss: 0.2123\n", - "epoch 7 average loss: 0.2094\n", - "----------\n", - "epoch 8/25\n", - "1/16, train_loss: 0.3276\n", - "2/16, train_loss: 0.1800\n", - "3/16, train_loss: 0.3311\n", - "4/16, train_loss: 0.0459\n", - "5/16, train_loss: 0.2114\n", - "6/16, train_loss: 0.0853\n", - "7/16, train_loss: 0.2206\n", - "8/16, train_loss: 0.1529\n", - "9/16, train_loss: 0.0939\n", - "10/16, train_loss: 0.4467\n", - "11/16, train_loss: 0.2725\n", - "12/16, train_loss: 0.2171\n", - "13/16, train_loss: 0.1600\n", - "14/16, train_loss: 0.1502\n", - "15/16, train_loss: 0.2140\n", - "16/16, train_loss: 0.2115\n", - "epoch 8 average loss: 0.2075\n", - "saved new best metric model\n", - "current epoch: 8 current mean dice: 0.8957\n", - "best mean dice: 0.8957 at epoch: 8\n", - "----------\n", - "epoch 9/25\n", - "1/16, train_loss: 0.3275\n", - "2/16, train_loss: 0.1822\n", - "3/16, train_loss: 0.3309\n", - "4/16, train_loss: 0.0455\n", - "5/16, train_loss: 0.2110\n", - "6/16, train_loss: 0.0818\n", - "7/16, train_loss: 0.2194\n", - "8/16, train_loss: 0.1520\n", - "9/16, train_loss: 0.0917\n", - "10/16, train_loss: 0.4467\n", - "11/16, train_loss: 0.2723\n", - "12/16, train_loss: 0.2165\n", - "13/16, train_loss: 0.1593\n", - "14/16, train_loss: 0.1236\n", - "15/16, train_loss: 0.2136\n", - "16/16, train_loss: 0.2107\n", - "epoch 9 average loss: 0.2053\n", - "----------\n", - "epoch 10/25\n", - "1/16, train_loss: 0.3271\n", - "2/16, train_loss: 0.1726\n", - "3/16, train_loss: 0.3308\n", - "4/16, train_loss: 0.0439\n", - "5/16, train_loss: 0.2106\n", - "6/16, train_loss: 0.0886\n", - "7/16, train_loss: 0.2209\n", - "8/16, train_loss: 0.1518\n", - "9/16, train_loss: 0.0860\n", - "10/16, train_loss: 0.4452\n", - "11/16, train_loss: 0.2715\n", - "12/16, train_loss: 0.2150\n", - "13/16, train_loss: 0.1589\n", - "14/16, train_loss: 0.1150\n", - "15/16, train_loss: 0.2142\n", - "16/16, train_loss: 0.2095\n", - "epoch 10 average loss: 0.2038\n", - "saved new best metric model\n", - "current epoch: 10 current mean dice: 0.8958\n", - "best mean dice: 0.8958 at epoch: 10\n", - "----------\n", - "epoch 11/25\n", - "1/16, train_loss: 0.3271\n", - "2/16, train_loss: 0.1735\n", - "3/16, train_loss: 0.3314\n", - "4/16, train_loss: 0.0430\n", - "5/16, train_loss: 0.2099\n", - "6/16, train_loss: 0.0801\n", - "7/16, train_loss: 0.2201\n", - "8/16, train_loss: 0.1508\n", - "9/16, train_loss: 0.0721\n", - "10/16, train_loss: 0.4451\n", - "11/16, train_loss: 0.2714\n", - "12/16, train_loss: 0.2155\n", - "13/16, train_loss: 0.1592\n", - "14/16, train_loss: 0.1247\n", - "15/16, train_loss: 0.2139\n", - "16/16, train_loss: 0.2107\n", - "epoch 11 average loss: 0.2030\n", - "----------\n", - "epoch 12/25\n", - "1/16, train_loss: 0.3268\n", - "2/16, train_loss: 0.1712\n", - "3/16, train_loss: 0.3305\n", - "4/16, train_loss: 0.0453\n", - "5/16, train_loss: 0.2103\n", - "6/16, train_loss: 0.0783\n", - "7/16, train_loss: 0.2179\n", - "8/16, train_loss: 0.1529\n", - "9/16, train_loss: 0.0912\n", - "10/16, train_loss: 0.4469\n", - "11/16, train_loss: 0.2724\n", - "12/16, train_loss: 0.2162\n", - "13/16, train_loss: 0.1588\n", - "14/16, train_loss: 0.1072\n", - "15/16, train_loss: 0.2129\n", - "16/16, train_loss: 0.2091\n", - "epoch 12 average loss: 0.2030\n", - "saved new best metric model\n", - "current epoch: 12 current mean dice: 0.9008\n", - "best mean dice: 0.9008 at epoch: 12\n", - "----------\n", - "epoch 13/25\n", - "1/16, train_loss: 0.3266\n", - "2/16, train_loss: 0.1666\n", - "3/16, train_loss: 0.3304\n", - "4/16, train_loss: 0.0419\n", - "5/16, train_loss: 0.2105\n", - "6/16, train_loss: 0.0826\n", - "7/16, train_loss: 0.2195\n", - "8/16, train_loss: 0.1506\n", - "9/16, train_loss: 0.0553\n", - "10/16, train_loss: 0.4447\n", - "11/16, train_loss: 0.2715\n", - "12/16, train_loss: 0.2125\n", - "13/16, train_loss: 0.1575\n", - "14/16, train_loss: 0.1083\n", - "15/16, train_loss: 0.2135\n", - "16/16, train_loss: 0.2085\n", - "epoch 13 average loss: 0.2000\n", - "----------\n", - "epoch 14/25\n", - "1/16, train_loss: 0.3270\n", - "2/16, train_loss: 0.1647\n", - "3/16, train_loss: 0.3316\n", - "4/16, train_loss: 0.0405\n", - "5/16, train_loss: 0.2091\n", - "6/16, train_loss: 0.0686\n", - "7/16, train_loss: 0.2185\n", - "8/16, train_loss: 0.1499\n", - "9/16, train_loss: 0.0482\n", - "10/16, train_loss: 0.4443\n", - "11/16, train_loss: 0.2708\n", - "12/16, train_loss: 0.2106\n", - "13/16, train_loss: 0.1568\n", - "14/16, train_loss: 0.1043\n", - "15/16, train_loss: 0.2121\n", - "16/16, train_loss: 0.2079\n", - "epoch 14 average loss: 0.1978\n", - "saved new best metric model\n", - "current epoch: 14 current mean dice: 0.9015\n", - "best mean dice: 0.9015 at epoch: 14\n", - "----------\n", - "epoch 15/25\n", - "1/16, train_loss: 0.3259\n", - "2/16, train_loss: 0.1630\n", - "3/16, train_loss: 0.3303\n", - "4/16, train_loss: 0.0399\n", - "5/16, train_loss: 0.2085\n", - "6/16, train_loss: 0.0579\n", - "7/16, train_loss: 0.2165\n", - "8/16, train_loss: 0.1509\n", - "9/16, train_loss: 0.0487\n", - "10/16, train_loss: 0.4449\n", - "11/16, train_loss: 0.2704\n", - "12/16, train_loss: 0.2090\n", - "13/16, train_loss: 0.1557\n", - "14/16, train_loss: 0.1021\n", - "15/16, train_loss: 0.2118\n", - "16/16, train_loss: 0.2084\n", - "epoch 15 average loss: 0.1965\n", - "----------\n", - "epoch 16/25\n", - "1/16, train_loss: 0.3258\n", - "2/16, train_loss: 0.1620\n", - "3/16, train_loss: 0.3307\n", - "4/16, train_loss: 0.0394\n", - "5/16, train_loss: 0.2086\n", - "6/16, train_loss: 0.0699\n", - "7/16, train_loss: 0.2170\n", - "8/16, train_loss: 0.1516\n", - "9/16, train_loss: 0.0540\n", - "10/16, train_loss: 0.4444\n", - "11/16, train_loss: 0.2698\n", - "12/16, train_loss: 0.2102\n", - "13/16, train_loss: 0.1548\n", - "14/16, train_loss: 0.1016\n", - "15/16, train_loss: 0.2114\n", - "16/16, train_loss: 0.2078\n", - "epoch 16 average loss: 0.1974\n", - "current epoch: 16 current mean dice: 0.8994\n", - "best mean dice: 0.9015 at epoch: 14\n", - "----------\n", - "epoch 17/25\n", - "1/16, train_loss: 0.3255\n", - "2/16, train_loss: 0.1636\n", - "3/16, train_loss: 0.3300\n", - "4/16, train_loss: 0.0399\n", - "5/16, train_loss: 0.2085\n", - "6/16, train_loss: 0.0483\n", - "7/16, train_loss: 0.2150\n", - "8/16, train_loss: 0.1506\n", - "9/16, train_loss: 0.0446\n", - "10/16, train_loss: 0.4445\n", - "11/16, train_loss: 0.2692\n", - "12/16, train_loss: 0.2077\n", - "13/16, train_loss: 0.1515\n", - "14/16, train_loss: 0.0980\n", - "15/16, train_loss: 0.2110\n", - "16/16, train_loss: 0.2076\n", - "epoch 17 average loss: 0.1947\n", - "----------\n", - "epoch 18/25\n", - "1/16, train_loss: 0.3255\n", - "2/16, train_loss: 0.1614\n", - "3/16, train_loss: 0.3297\n", - "4/16, train_loss: 0.0381\n", - "5/16, train_loss: 0.2081\n", - "6/16, train_loss: 0.0422\n", - "7/16, train_loss: 0.2152\n", - "8/16, train_loss: 0.1485\n", - "9/16, train_loss: 0.0415\n", - "10/16, train_loss: 0.4442\n", - "11/16, train_loss: 0.2690\n", - "12/16, train_loss: 0.2070\n", - "13/16, train_loss: 0.1515\n", - "14/16, train_loss: 0.0980\n", - "15/16, train_loss: 0.2112\n", - "16/16, train_loss: 0.2068\n", - "epoch 18 average loss: 0.1936\n", - "current epoch: 18 current mean dice: 0.8991\n", - "best mean dice: 0.9015 at epoch: 14\n", - "----------\n", - "epoch 19/25\n", - "1/16, train_loss: 0.3254\n", - "2/16, train_loss: 0.1635\n", - "3/16, train_loss: 0.3297\n", - "4/16, train_loss: 0.0372\n", - "5/16, train_loss: 0.2078\n", - "6/16, train_loss: 0.0424\n", - "7/16, train_loss: 0.2145\n", - "8/16, train_loss: 0.1483\n", - "9/16, train_loss: 0.0402\n", - "10/16, train_loss: 0.4436\n", - "11/16, train_loss: 0.2695\n", - "12/16, train_loss: 0.2076\n", - "13/16, train_loss: 0.1514\n", - "14/16, train_loss: 0.1009\n", - "15/16, train_loss: 0.2116\n", - "16/16, train_loss: 0.2071\n", - "epoch 19 average loss: 0.1938\n", - "----------\n", - "epoch 20/25\n", - "1/16, train_loss: 0.3256\n", - "2/16, train_loss: 0.1616\n", - "3/16, train_loss: 0.3302\n", - "4/16, train_loss: 0.0376\n", - "5/16, train_loss: 0.2080\n", - "6/16, train_loss: 0.0756\n", - "7/16, train_loss: 0.2150\n", - "8/16, train_loss: 0.1476\n", - "9/16, train_loss: 0.0400\n", - "10/16, train_loss: 0.4440\n", - "11/16, train_loss: 0.2686\n", - "12/16, train_loss: 0.2071\n", - "13/16, train_loss: 0.1512\n", - "14/16, train_loss: 0.0990\n", - "15/16, train_loss: 0.2103\n", - "16/16, train_loss: 0.2066\n", - "epoch 20 average loss: 0.1955\n", - "current epoch: 20 current mean dice: 0.8984\n", - "best mean dice: 0.9015 at epoch: 14\n", - "----------\n", - "epoch 21/25\n", - "1/16, train_loss: 0.3253\n", - "2/16, train_loss: 0.1599\n", - "3/16, train_loss: 0.3295\n", - "4/16, train_loss: 0.0370\n", - "5/16, train_loss: 0.2074\n", - "6/16, train_loss: 0.0587\n", - "7/16, train_loss: 0.2138\n", - "8/16, train_loss: 0.1483\n", - "9/16, train_loss: 0.0479\n", - "10/16, train_loss: 0.4449\n", - "11/16, train_loss: 0.2684\n", - "12/16, train_loss: 0.2082\n", - "13/16, train_loss: 0.1520\n", - "14/16, train_loss: 0.1122\n", - "15/16, train_loss: 0.2110\n", - "16/16, train_loss: 0.2088\n", - "epoch 21 average loss: 0.1958\n", - "----------\n", - "epoch 22/25\n", - "1/16, train_loss: 0.3258\n", - "2/16, train_loss: 0.1628\n", - "3/16, train_loss: 0.3298\n", - "4/16, train_loss: 0.0395\n", - "5/16, train_loss: 0.2082\n", - "6/16, train_loss: 0.0614\n", - "7/16, train_loss: 0.2181\n", - "8/16, train_loss: 0.1566\n", - "9/16, train_loss: 0.0650\n", - "10/16, train_loss: 0.4442\n", - "11/16, train_loss: 0.2693\n", - "12/16, train_loss: 0.2118\n", - "13/16, train_loss: 0.1532\n", - "14/16, train_loss: 0.0998\n", - "15/16, train_loss: 0.2121\n", - "16/16, train_loss: 0.2076\n", - "epoch 22 average loss: 0.1978\n", - "saved new best metric model\n", - "current epoch: 22 current mean dice: 0.9054\n", - "best mean dice: 0.9054 at epoch: 22\n", - "----------\n", - "epoch 23/25\n", - "1/16, train_loss: 0.3266\n", - "2/16, train_loss: 0.1723\n", - "3/16, train_loss: 0.3315\n", - "4/16, train_loss: 0.0413\n", - "5/16, train_loss: 0.2091\n", - "6/16, train_loss: 0.0807\n", - "7/16, train_loss: 0.2143\n", - "8/16, train_loss: 0.1514\n", - "9/16, train_loss: 0.0432\n", - "10/16, train_loss: 0.4441\n", - "11/16, train_loss: 0.2704\n", - "12/16, train_loss: 0.2081\n", - "13/16, train_loss: 0.1532\n", - "14/16, train_loss: 0.0983\n", - "15/16, train_loss: 0.2106\n", - "16/16, train_loss: 0.2072\n", - "epoch 23 average loss: 0.1976\n", - "----------\n", - "epoch 24/25\n", - "1/16, train_loss: 0.3257\n", - "2/16, train_loss: 0.1711\n", - "3/16, train_loss: 0.3307\n", - "4/16, train_loss: 0.0376\n", - "5/16, train_loss: 0.2077\n", - "6/16, train_loss: 0.0705\n", - "7/16, train_loss: 0.2141\n", - "8/16, train_loss: 0.1482\n", - "9/16, train_loss: 0.0392\n", - "10/16, train_loss: 0.4439\n", - "11/16, train_loss: 0.2688\n", - "12/16, train_loss: 0.2070\n", - "13/16, train_loss: 0.1512\n", - "14/16, train_loss: 0.0969\n", - "15/16, train_loss: 0.2098\n", - "16/16, train_loss: 0.2062\n", - "epoch 24 average loss: 0.1955\n", - "saved new best metric model\n", - "current epoch: 24 current mean dice: 0.9060\n", - "best mean dice: 0.9060 at epoch: 24\n", - "----------\n", - "epoch 25/25\n", - "1/16, train_loss: 0.3251\n", - "2/16, train_loss: 0.1621\n", - "3/16, train_loss: 0.3298\n", - "4/16, train_loss: 0.0367\n", - "5/16, train_loss: 0.2075\n", - "6/16, train_loss: 0.0430\n", - "7/16, train_loss: 0.2132\n", - "8/16, train_loss: 0.1490\n", - "9/16, train_loss: 0.0390\n", - "10/16, train_loss: 0.4432\n", - "11/16, train_loss: 0.2699\n", - "12/16, train_loss: 0.2080\n", - "13/16, train_loss: 0.1520\n", - "14/16, train_loss: 0.0959\n", - "15/16, train_loss: 0.2101\n", - "16/16, train_loss: 0.2057\n", - "epoch 25 average loss: 0.1931\n", - "train completed, best_metric: 0.9060 at epoch: 24\n" - ] - } - ], + "outputs": [], "source": [ "max_epochs = 25\n", "val_interval = 2\n", @@ -1443,23 +664,10 @@ }, { "cell_type": "code", - "execution_count": 24, + "execution_count": null, "id": "5cf1fd04", "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "plt.figure(\"train\", (12, 6))\n", "plt.subplot(1, 2, 1)\n", @@ -1498,21 +706,10 @@ }, { "cell_type": "code", - "execution_count": 25, + "execution_count": null, "id": "29441405", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 25, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "model.load_state_dict(torch.load('monai_data/best_metric_model_pretrained.pth'))" ] @@ -1527,7 +724,7 @@ }, { "cell_type": "code", - "execution_count": 26, + "execution_count": null, "id": "94615f38", "metadata": {}, "outputs": [], @@ -1552,33 +749,10 @@ }, { "cell_type": "code", - "execution_count": 27, + "execution_count": null, "id": "a3f78dd4", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 27, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "fig = plt.figure(frameon=False, figsize=(7,7))\n", "plt.title('Trained Calculated Spleen')\n", @@ -1587,33 +761,10 @@ }, { "cell_type": "code", - "execution_count": 28, + "execution_count": null, "id": "a67f89f2", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 28, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAAAVcAAAGrCAYAAAB0YdR6AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8rg+JYAAAACXBIWXMAAAsTAAALEwEAmpwYAAAgSElEQVR4nO3de5hV9X3v8fdXLiM4qIA3ROINokBriXK8xlZPc4ol9ZC0TYtJKraemDSJp3liWo05UdKGE3tJ4h89SWpONGhMjEk0kjRpNMbGJ03VgAKKHAQUYYRwFeWiIMzv/LHW4AZmmOuPtTfzfj3PPLPWb92+a82az6z1W3vviZQSkqS+dVjVBUjSochwlaQMDFdJysBwlaQMDFdJysBwlaQMDNceioivRMSna8b/MiLWRsTWiBgZERdFxNJy/F0VlqoGFhFfj4jPHqRtpYgYezC21d1tRsQlEdFyMGrqK4ZrOyJiRUS8FhFbImJzRPwyIj4UEXuOV0rpQymlvyvnHwR8Afi9lFJzSmkj8LfAP5fj369kRzKJiJkR8Ub5h2NrRCyOiD/qxvIrIuIdOWvsC1F4PiKe7cYyMyPiGznrqkJE/HsZhL+1T/v3y/ZLqqmsfhmuHbs8pTQMOBm4Bbge+FoH8x4PHA4sqmk7eZ/xLouIgT1Z7iD7dvmHoxn4GPCNiDi+4pr62m8DxwGnRcR/qbqYOvAccGXbSESMBM4H1ldWUR0zXDuRUnolpTQH+FNgRkT8Brx5uxYRbwWWlLNvjoifRcRy4DTgB+WVXVNEHBURX4uINRHxUrnsgHJdV0XEf0TEFyNiEzCzXOafImJl2d3wlYgYUs5/SUS0RMR1EbGuXOeft9UcEUMi4vMR8WJEvBIRv6hZ9vzySnxzRCyoveIo63i+vGJ/ISLe18Vj9BNgC3B6zbr+ICLm11z5n1W23wW8pebY/E1EzI6I68rpo8sroQ+X42MjYlNExIHWW047MSK+FxHry/r/Z820mRFxb0TcWe7fooiY3MmuzQAeAH5UDu8RERMj4qGytrURcWNEXAbcCPxpuW8Lynn3ulLf9+o2Ir4TEb8uf1aPRsTErhz3iDi9PN82RsSGiLg7Io6umb4iIj4REQvLdX87Ig6vmf7X5bmzOiL+ogubvLvctwHl+BXA/cDOmnU2RcSt5TpXl8NNXdnmgc75hpRS8mufL2AF8I522lcCf1kOfx34bDl8CpCAgR2tA/g+8C/AERRXQ08AHyynXQXsAq4FBgJDgFuBOcAIYBjwA+Bz5fyXlPP/LTAImApsB4aX0/8P8O/AaGAAcCHQVI5vLOc/DPhv5fixZV2vAmeU6xgFTOzg+MwEvlEOB/BOYDNwdNl2NrAOOK/c/ozyeDR1cGz+AvhBOfxeYDnFlXHbtAc6W2+5P/OAm4DBFH/cngem1NT8ernvA4DPAY8d4BwYWh6PqcAfARuAweW0YcAa4DqKO5ZhwHn7HpsDnAt7zVPu47ByP24F5tdM+zrledZOjWPLn2FT+TN8FLh1n+0+AZxIcR4tBj5UTrsMWAv8Rvmz/ybFOTy2g239O/A/gAeB3y/bngAuAFqAS8q2vwUeozjHjwV+CfxdV7ZJ5+d8S9XZ0K0cqbqAevza95ehpv0x4FPl8J6Tnk7ClaLbYAcwpGb6FcAj5fBVwMqaaQFsA06vabsAeKEcvgR4bZ/traO4RTusnPZb7dR/PXDXPm0/oQipIygC8o9q6+zg+MykuFrZTBHqu4G/qZn+5bZfqJq2JcDvtHd8Ka54N5e1fwX4YNsvEjAb+Hhn66UI3JX7TPskcEdNzT+tmTYBeO0A+/h+itvdgRThtRl4d83P7qkDHJtuhes+8x5dnktH7XuedeG8fVdtXeV2318z/g/AV8rh24Fbaqa9la6F6/uBbwFnAM+V02rDdTkwtWa5KcCKzrZJ1875hgpXuwW6ZzSwqQfLnUxxhbmmvJ3dTHEVe1zNPKtqho+luHKaVzP/v5XtbTamlHbVjG8HmoFjKK6mlndQx3va1lmu9+3AqJTSNoqujw+Vdf5rRJx5gH26N6V0dEppKEU4XhkRH6zZznX7bGcMxRXUflJKy4GtwCTgYuCHwOqIOIMiOH/ehfWeDJy4z7QbKf6wtfn1Psfr8Oi4f3tGuY+7Uko7gPt4s2tgDO0f326LiAERcUtELI+IVykCEYqfY2fLHhcR90TRzfQq8I12ltt3n5vL4RPZ+5x7sYsl3wf8V4q7rLvamX7iPut6kTd/7gfaZlfO+YbSCA9O6kIUDzRGA7/oweKrKK5cj9knEGvVfjzZBoqrz4kppZe6ua0NFLe/pwML2qnjrpTSB9otoOg7/UnZz/VZ4KsUYXdAKaUVEfFj4HKKPxqrgFkppVkdLdJO28+BP6a49X4pIn5O8fBkODC/pv521xsRbVc54zqrtzMRcRJFgJwbb74KYihFGB9T1nFFB4u3t2/byuXbnFAz/F5gGvAOimA9CniZ4kquM58rt3dWSmljFC/5++cuLAdFt8aYmvG3dGWhlNL28mf9l9T0sddYzd4Pc99StnW2zd6c83XJK9dORMSREfEHwD0Ut3JPd3cdKaU1FH1Vny/Xd1j5MOJ3Opi/lSLYvhgRx5V1jI6IKV3YVivF7dcXygc8AyLigvKhwjeAyyNiStl+eBQPx06KiOMj4r9HxBEUfwi2Utzud6oMo8t48xfqq8CHIuK8KBwREe+MiGHl9LUUfaK1fg58lKLfEIrb0GuBX6SU2uo40HqfAF6NiOujeKA3ICJ+I3r2lP/PKJ6Mn0FxNT2J4ha2hSJUfwicEBEfKx/CDIuI82r27ZSoedkexR+H6RExKIqHaH9cM20YxfHeSBHA/7sbdQ6j+DltjojRwF93Y9l7gasiYkJEDAVu7sayN1J08axoZ9q3gP8VEceWf4huojjvDrjN3pzz9cpw7dgPImILxVXKpyhex/rnB17kgK6keNDyLMWVyXcpHhp15HpgGfBYecv3U4pf9q74BPA08CuKboy/Bw5LKa2iuEq6kaI/cRXFL+Rh5dd1FFcZmyhuxz98gG20PRHfWm7nP4DPAKSU5gIfoLiKerncj6tqlv0cxS/g5oj4RNn2c4qwaAvXX1CETdv4AddbBvDlFEH4AsWV0P+luBLsrhnAl1JKv679ougPnpFS2kLxIOlyitvupcCl5bLfKb9vjIgny+FPU1zlvVweo2/WbOtOitvjlyjOjce6UednKB7yvQL8K8Ute5eklH5M8QDpZxTH8WfdWHZ1SqmjO7jPAnOBhRTn4JNlW1e22Ztzvu5E2VksSepDXrlKUgaGqyRlkC1cI+KyiFgSEcsi4oZc25GkepSlzzWKt8c9R9Hp30LxwOOKlFKXPwBDkhpZrte5ngssSyk9DxAR91A8pW43XJubm9PIkSMzlSJJeWzcuJGtW7e2+5rkXOE6mr3fidFC8fbEPSLiGuAagBEjRnDDDfYcSGost9xyS4fTcvW5tpfke/U/pJRuSylNTilNbm5ubmd2SWpcucK1hb3f5nYSb74FTpIOebnC9VfAuIg4NSIGA9MpPkpMkvqFLH2uKaVdEfFRio+zGwDcnlLq0afyS1IjyvapWCmlH1F8grsk9Tu+Q0uSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSmDgVUXIPXGpZdeyvDhw7nvvvt6vI65c+fywx/+EIBrr72WkSNH9lV56scMV9W9IUOGcNFFF+0Z3759O7/85S8BWLt2Ldu2bevV+o899lguvvhioAja008/nbFjx/ZqnZLhqro3ePBgRo8evWe8NkyfffbZXq//5JNP5uSTTwZg5syZbNmyhaOOOopjjz221+tW/2W4qu698sorzJ49+6BsKyL42c9+xqJFi7j55psPyjZ1aDJcVbnBgwfzvve9j4jgpZde4ic/+UlltXz84x/nkUce4amnnqqsBh0aDFdV5vzzzyelxNy5c1m8eDERwaZNmyqtadiwYQwdOrTSGnRoMFxVmSOPPJLW1lZ2797NY489VnU5Up8yXHVQRAQppb3aHnzwwYqqkfLr1ZsIImJFRDwdEfMjYm7ZNiIiHoqIpeX34X1TqhpVRHDVVVfxtre9repSpIOmL96hdWlKaVJKaXI5fgPwcEppHPBwOa5+aPLkybzjHe8gpcSTTz7JqlWrqi6pSyZOnMif/MmfVF2GGlyOt79OA9peNzMbeFeGbaiOpZRYtmwZr732GgMHFj1PCxYsYMOGDRVX1jWjRo1i/PjxVZehBtfbPtcEPBgRCfiXlNJtwPEppTUAKaU1EXFcb4tU45k9ezbr16/nggsuqLoUqRK9DdeLUkqrywB9KCL+X1cXjIhrgGsARowY0csyVG9uvvlmDjvMzwVS/9Wrsz+ltLr8vg64HzgXWBsRowDK7+s6WPa2lNLklNLk5ubm3pShOnD55Zdz9tlnA8UDrIEDBxqu6td6fPZHxBERMaxtGPg94BlgDjCjnG0G8EBvi1T927lzJ7t27aq6DKlu9KZb4Hjg/ohoW883U0r/FhG/Au6NiKuBlcB7el+m6l2Vb1mV6lGPwzWl9DzwW+20bwR+tzdFSVKjs1NMkjIwXCUpA8NVXTZ48GCOOuqoqsvoE62trbzyyivs3r276lJ0iDJc1WWTJk3iPe85NJ5Pbt++nRtvvJE1a9ZUXYoOUX4qlrrsmWee4cUXX6y6jD4xdOhQPv3pT/vPCJWN4aou2759O9u3b6+6jF6ZN28eQ4YMYcKECZxwwglVl6NDmN0COqDW1lbWr1/PG2+8UXUpPfb666+zfv16UkosWrSIFStWVF2S+gHDVQe0c+dOPvOZz7By5cqqS+mRtk/omjVrFq2trVx55ZVMnTq16rLUD9gtoA5NnDiRyZMns3Pnzob9v1Lf/va32blzJzNnzvSzDnRQGa7q0Pr163nuuec46qij9vsXLY3irLPOAuDoo4+uthD1O4ar2rV+/Xo2bNjAunXtfqhZw5gwYULVJaif8j5Je0kp0drayn333cecOXOqLkdqWF65ai+7d+/mpptuYvr06bz1rW+tuhypYXnlqv1s27aNgQMHcvjhh1ddSrdMnTqVsWPHVl2GBBiu2kdEcMYZZzTkqwOGDh3K4MGDqy5DAuwW0D6ampr48Ic/XHUZPfLd73636hKkPbxy1R5nnnkmV111FYMGDaq6FKnhGa7ao6Wlhf/8z/9sqP+Fdd555zFlypSqy5D2Y7eA9ti6dSuLFi2quoxuaW1tpbW1teoypP0Yrv1cSokdO3YwaNAgBgwYUHU53dLU1MSTTz7pB16rLtkt0M+llLjppptYsGBB1aV023vf+17OO++8qsuQ2uWVaz/261//mrvvvpurr76aE088sepyui0iKP+1u1R3DNd+6oUXXqClpYUxY8Zwyimn0NTUVHVJ3dbS0sLGjRurLkNql+HaT82bN4/ly5dz/fXXV11Kjz344INVlyB1yD5XScrAcJWkDOwW6KfGjx/fkA+xpEZhuPYzKSVefvllTjvtNIYMGVJ1Od0yfPhwduzY0fD/gVb9g90C/dA//uM/8tRTT1VdRrdNmzaN888/v+oypC7xylUN44EHHmDHjh1VlyF1ieGqhvHyyy9XXYLUZXYL9EMnnnhiQ34YttRIvHLtZyKCa6+9tuoypEOeV66SlIHhqroyduxYLrvssqrLkHrNcFXdOOmkkzjhhBMYNmxY1aVIvWafq+rGxRdfzNatW/nOd75TdSlSr3nlKkkZeOWqHpkyZQrNzc17xh9++GE2b97cq3XOmzfPNwnokGG49jMpJZYuXcrIkSMZOXJkp/MPGDCA8ePH79fe1NREc3PzntfLjh8/ni1btvDGG2+wZMmSHtX23HPPATBw4EDOPPNMli9fzmuvvdajdUlVM1z7oTvuuIPLL7+cCy+8sMN5Bg4sTo0BAwZwwQUX7PfvVH784x/zlre8hYkTJwIwadIkALZv397jcG3T1NTEhRdeyCuvvMKqVat6tS6pKoar2jVt2jRaW1u5//77ueOOO/abvmvXLlpaWnj88ccBeP/7399nn7K1bds2br/9dnbt2tUn65OqYLj2Qx/4wAdYvHgxt956617t48aN453vfCcAjz/+OCklgA5DLqW0Z9qjjz7KWWedxXHHHccVV1zBnDlz2LZtW49rNFjV6AzXfiYiOO2003j55Zf3+yCU2gdULS0t3Vrv888/z7Bhw/YE8m/+5m/y4osvsmbNmt4XLTUgw7WfOuecczjnnHP6dJ0LFixgwYIFAFx55ZUMHDiQzZs3+1BK/ZKvc1UWd955J0ceeSR/+Id/WHUpUiW8clU2jz32GE1NTUQE06dPZ+HChSxatKjqsqSDwitXZbNp0ybWrFlDSon169ezbds2RowYwYUXXsiAAQOqLk/KynDVQfHTn/6UFStWMHLkSM444wyOOeYYBg0aVHVZUjaGqw6qpUuXctdddzFt2jTOPPPMqsuRsrHPVQfdrl27+P73v88FF1zAOeecw+7du/nmN7/J7t27qy5N6jOGqyqxbt06Vq5cyauvvgrApZdeuqd94cKFe8372muv8eijj/L2t7+dI4444qDXKvWE4arKPPXUUwAMHjyY6dOnExEMHjyYDRs27DXfxo0befzxx5k8ebLhqoZhuKpyO3fu5M477wTg3HPPZdq0aXtNf/311/0oQjUcw1V1Zf78+Xs+erBNa2trRdVIPWe4qq7s3LmTnTt3Vl2G1Gu+FEuSMug0XCPi9ohYFxHP1LSNiIiHImJp+X14zbRPRsSyiFgSEVNyFS5J9awrV65fB/b9R/I3AA+nlMYBD5fjRMQEYDowsVzmSxHh+xwl9TudhmtK6VFg0z7N04DZ5fBs4F017feklHaklF4AlgHn9k2pktQ4etrnenxKaQ1A+f24sn00UPtPj1rKtv1ExDURMTci5m7durWHZUhSferrB1rRTltqb8aU0m0ppckppcm1n4AvSYeCnobr2ogYBVB+X1e2twBjauY7CVjd8/IkqTH1NFznADPK4RnAAzXt0yOiKSJOBcYBT/SuRElqPJ2+iSAivgVcAhwTES3AzcAtwL0RcTWwEngPQEppUUTcCzwL7AI+klLyo44k9TudhmtK6YoOJv1uB/PPAmb1pihJanS+Q0uSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMug0XCPi9ohYFxHP1LTNjIiXImJ++TW1ZtonI2JZRCyJiCm5CpeketaVK9evA5e10/7FlNKk8utHABExAZgOTCyX+VJEDOirYiWpUXQarimlR4FNXVzfNOCelNKOlNILwDLg3F7UJ0kNqTd9rh+NiIVlt8Hwsm00sKpmnpaybT8RcU1EzI2IuVu3bu1FGZJUf3oarl8GTgcmAWuAz5ft0c68qb0VpJRuSylNTilNbm5u7mEZklSfehSuKaW1KaXdKaVW4Ku8eevfAoypmfUkYHXvSpSkxtOjcI2IUTWj7wbaXkkwB5geEU0RcSowDniidyVKUuMZ2NkMEfEt4BLgmIhoAW4GLomISRS3/CuADwKklBZFxL3As8Au4CMppd1ZKpekOtZpuKaUrmin+WsHmH8WMKs3RUlSo/MdWpKUgeEqSRkYrpKUgeEqSRkYrpKUgeEqSRkYrpKUgeEqSRkYrpKUgeEqSRkYrpKUgeEqSRkYrpKUgeEqSRkYrpKUgeEqSRkYrpKUgeEqSRkYrpKUgeEqSRkYrpKUgeEqSRkYrpKUgeEqSRkYrpKUgeEqSRkYrpKUgeEqSRkYrpKUgeEqSRkYrpKUgeEqSRkYrpKUgeEqSRkYrpKUgeEqSRkYrpKUgeEqSRkYrpKUgeEqSRkYrpKUgeEqSRkYrpKUgeEqSRkYrpKUgeEqSRkYrpKUgeEqSRkYrpKUgeEqSRkYrpKUgeEqSRkYrpKUgeEqSRkYrpKUgeEqSRl0Gq4RMSYiHomIxRGxKCL+qmwfEREPRcTS8vvwmmU+GRHLImJJREzJuQOSVI+6cuW6C7gupTQeOB/4SERMAG4AHk4pjQMeLscpp00HJgKXAV+KiAE5ipeketVpuKaU1qSUniyHtwCLgdHANGB2Odts4F3l8DTgnpTSjpTSC8Ay4Nw+rluS6lq3+lwj4hTgbcDjwPEppTVQBDBwXDnbaGBVzWItZdu+67omIuZGxNytW7f2oHRJql9dDteIaAa+B3wspfTqgWZtpy3t15DSbSmlySmlyc3NzV0tQ5IaQpfCNSIGUQTr3Sml+8rmtRExqpw+ClhXtrcAY2oWPwlY3TflSlJj6MqrBQL4GrA4pfSFmklzgBnl8AzggZr26RHRFBGnAuOAJ/quZEmqfwO7MM9FwJ8BT0fE/LLtRuAW4N6IuBpYCbwHIKW0KCLuBZ6leKXBR1JKu/u6cEmqZ52Ga0rpF7Tfjwrwux0sMwuY1Yu6JKmh+Q4tScrAcJWkDAxXScrAcJWkDAxXScrAcJWkDAxXScrAcJWkDAxXScrAcJWkDAxXScrAcJWkDAxXScrAcJWkDAxXScrAcJWkDAxXScrAcJWkDAxXScrAcJWkDAxXScrAcJWkDAxXScrAcJWkDAxXScrAcJWkDAxXScrAcJWkDAxXScrAcJWkDAxXScrAcJWkDAxXScrAcJWkDAxXScrAcJWkDAxXScrAcJWkDAxXScrAcJWkDAxXScrAcJWkDAxXScrAcJWkDAxXScrAcJWkDAxXScrAcJWkDAxXScrAcJWkDAxXScrAcJWkDAxXScrAcJWkDAxXScrAcJWkDAxXScqg03CNiDER8UhELI6IRRHxV2X7zIh4KSLml19Ta5b5ZEQsi4glETEl5w5IUj0a2IV5dgHXpZSejIhhwLyIeKic9sWU0j/VzhwRE4DpwETgROCnEfHWlNLuvixckupZp1euKaU1KaUny+EtwGJg9AEWmQbck1LakVJ6AVgGnNsXxUpSo+hWn2tEnAK8DXi8bPpoRCyMiNsjYnjZNhpYVbNYC+2EcURcExFzI2Lu1q1bu1+5JNWxLodrRDQD3wM+llJ6FfgycDowCVgDfL5t1nYWT/s1pHRbSmlySmlyc3Nzd+uWpLrWpXCNiEEUwXp3Suk+gJTS2pTS7pRSK/BV3rz1bwHG1Cx+ErC670qWpPrXlVcLBPA1YHFK6Qs17aNqZns38Ew5PAeYHhFNEXEqMA54ou9KlqT615VXC1wE/BnwdETML9tuBK6IiEkUt/wrgA8CpJQWRcS9wLMUrzT4iK8UkNTfdBquKaVf0H4/6o8OsMwsYFYv6pKkhuY7tCQpA8NVkjIwXCUpA8NVkjIwXCUpA8NVkjIwXCUpA8NVkjIwXCUpA8NVkjIwXCUpA8NVkjIwXCUpA8NVkjIwXCUpA8NVkjIwXCUpA8NVkjIwXCUpA8NVkjIwXCUpA8NVkjIwXCUpA8NVkjIwXCUpA8NVkjIwXCUpA8NVkjIwXCUpA8NVkjIwXCUpA8NVkjIwXCUpA8NVkjIwXCUpA8NVkjIwXCUpA8NVkjIwXCUpA8NVkjIwXCUpA8NVkjIwXCUpA8NVkjIwXCUpA8NVkjKIlFLVNRAR64FtwIaqa6kDx+Bx8BgUPA71fwxOTikd296EughXgIiYm1KaXHUdVfM4eAzaeBwa+xjYLSBJGRiukpRBPYXrbVUXUCc8Dh6DNh6HBj4GddPnKkmHknq6cpWkQ4bhKkkZVB6uEXFZRCyJiGURcUPV9RxMEbEiIp6OiPkRMbdsGxERD0XE0vL78Krr7GsRcXtErIuIZ2raOtzviPhkeX4siYgp1VTdtzo4BjMj4qXyfJgfEVNrph1yxwAgIsZExCMRsTgiFkXEX5XtjX8+pJQq+wIGAMuB04DBwAJgQpU1HeT9XwEcs0/bPwA3lMM3AH9fdZ0Z9vu3gbOBZzrbb2BCeV40AaeW58uAqvch0zGYCXyinXkPyWNQ7tso4OxyeBjwXLm/DX8+VH3lei6wLKX0fEppJ3APMK3imqo2DZhdDs8G3lVdKXmklB4FNu3T3NF+TwPuSSntSCm9ACyjOG8aWgfHoCOH5DEASCmtSSk9WQ5vARYDozkEzoeqw3U0sKpmvKVs6y8S8GBEzIuIa8q241NKa6A48YDjKqvu4Opov/vbOfLRiFhYdhu03Qr3i2MQEacAbwMe5xA4H6oO12inrT+9NuyilNLZwO8DH4mI3666oDrUn86RLwOnA5OANcDny/ZD/hhERDPwPeBjKaVXDzRrO211eSyqDtcWYEzN+EnA6opqOehSSqvL7+uA+ylub9ZGxCiA8vu66io8qDra735zjqSU1qaUdqeUWoGv8ubt7iF9DCJiEEWw3p1Suq9sbvjzoepw/RUwLiJOjYjBwHRgTsU1HRQRcUREDGsbBn4PeIZi/2eUs80AHqimwoOuo/2eA0yPiKaIOBUYBzxRQX3ZtYVJ6d0U5wMcwscgIgL4GrA4pfSFmkmNfz5U/UQNmErxhHA58Kmq6zmI+30axVPPBcCitn0HRgIPA0vL7yOqrjXDvn+L4rb3DYorkasPtN/Ap8rzYwnw+1XXn/EY3AU8DSykCJFRh/IxKPfr7RS39QuB+eXX1EPhfPDtr5KUQdXdApJ0SDJcJSkDw1WSMjBcJSkDw1WSMjBcJSkDw1WSMvj/65Pouj2fyP8AAAAASUVORK5CYII=\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "fig = plt.figure(frameon=False, figsize=(7,7))\n", "plt.title('Differences Between Actual and Model')\n", @@ -1623,33 +774,10 @@ }, { "cell_type": "code", - "execution_count": 29, + "execution_count": null, "id": "382c7285", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 29, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "fig = plt.figure(frameon=False, figsize=(7,7))\n", "plt.title('Differences Between The Models')\n", @@ -1675,33 +803,10 @@ }, { "cell_type": "code", - "execution_count": 30, + "execution_count": null, "id": "91e83d40", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 30, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "maskedspleen = np.ma.masked_where(test_outputsSpl[0].cpu().numpy()[1][:,:,200] == 0, test_outputsSpl[0].cpu().numpy()[1][:,:,200])\n", "fig = plt.figure(frameon=False, figsize=(10,10))\n", @@ -1729,7 +834,7 @@ }, { "cell_type": "code", - "execution_count": 31, + "execution_count": null, "id": "657e44a0", "metadata": {}, "outputs": [], @@ -1748,28 +853,10 @@ }, { "cell_type": "code", - "execution_count": 32, + "execution_count": null, "id": "a6fb0da7", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "2022-04-27 15:06:54,404 - INFO - Expected md5 is None, skip md5 check for file monai_data/clara_pt_liver_and_tumor_ct_segmentation_1.zip.\n", - "2022-04-27 15:06:54,405 - INFO - File exists: monai_data/clara_pt_liver_and_tumor_ct_segmentation_1.zip, skipped downloading.\n", - "2022-04-27 15:06:54,425 - INFO - Non-empty folder exists in monai_data/clara_pt_liver_and_tumor_ct_segmentation, skipped extracting.\n", - "2022-04-27 15:06:54,426 - INFO - \n", - "*** \"clara_pt_liver_and_tumor_ct_segmentation\" available at monai_data/clara_pt_liver_and_tumor_ct_segmentation.\n", - "2022-04-27 15:06:54,889 - INFO - *** Model: \n", - "2022-04-27 15:06:54,938 - INFO - *** Model params: {'dimensions': 3, 'in_channels': 1, 'out_channels': 3, 'channels': [16, 32, 64, 128, 256], 'strides': [2, 2, 2, 2], 'num_res_units': 2, 'norm': 'batch'}\n", - "2022-04-27 15:06:54,950 - INFO - \n", - "---\n", - "2022-04-27 15:06:54,951 - INFO - For more information, please visit https://ngc.nvidia.com/catalog/models/nvidia:med:clara_pt_liver_and_tumor_ct_segmentation\n", - "\n" - ] - } - ], + "outputs": [], "source": [ " try: #MONAI=0.8\n", " unet_model = load_from_mmar(\n", @@ -1789,29 +876,10 @@ }, { "cell_type": "code", - "execution_count": 33, + "execution_count": null, "id": "55034354", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "using a pretrained model.\n", - "2022-04-27 15:06:55,931 - INFO - Expected md5 is None, skip md5 check for file monai_data/clara_pt_liver_and_tumor_ct_segmentation_1.zip.\n", - "2022-04-27 15:06:55,931 - INFO - File exists: monai_data/clara_pt_liver_and_tumor_ct_segmentation_1.zip, skipped downloading.\n", - "2022-04-27 15:06:55,932 - INFO - Non-empty folder exists in monai_data/clara_pt_liver_and_tumor_ct_segmentation, skipped extracting.\n", - "2022-04-27 15:06:55,933 - INFO - \n", - "*** \"clara_pt_liver_and_tumor_ct_segmentation\" available at monai_data/clara_pt_liver_and_tumor_ct_segmentation.\n", - "2022-04-27 15:06:55,962 - INFO - *** Model: \n", - "2022-04-27 15:06:56,010 - INFO - *** Model params: {'dimensions': 3, 'in_channels': 1, 'out_channels': 3, 'channels': [16, 32, 64, 128, 256], 'strides': [2, 2, 2, 2], 'num_res_units': 2, 'norm': 'batch'}\n", - "2022-04-27 15:06:56,023 - INFO - \n", - "---\n", - "2022-04-27 15:06:56,024 - INFO - For more information, please visit https://ngc.nvidia.com/catalog/models/nvidia:med:clara_pt_liver_and_tumor_ct_segmentation\n", - "\n" - ] - } - ], + "outputs": [], "source": [ "device = torch.device(\"cuda:0\" if torch.cuda.is_available() else \"cpu\")\n", "\n", @@ -1834,7 +902,7 @@ }, { "cell_type": "code", - "execution_count": 34, + "execution_count": null, "id": "a79c1731", "metadata": {}, "outputs": [], @@ -1860,33 +928,10 @@ }, { "cell_type": "code", - "execution_count": 35, + "execution_count": null, "id": "c0956706", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 35, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "sliceval = 215\n", "maskedliv = np.ma.masked_where(test_outputsliv[0].cpu().numpy()[1][:,:,sliceval] == 0, test_outputsliv[0].cpu().numpy()[1][:,:,sliceval])\n", @@ -1900,33 +945,10 @@ }, { "cell_type": "code", - "execution_count": 36, + "execution_count": null, "id": "5bdfdbe9", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 36, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "sliceval = 110\n", "maskedliv = np.ma.masked_where(test_outputsliv[0].cpu().numpy()[1][:,sliceval,:] == 0, test_outputsliv[0].cpu().numpy()[1][:,sliceval,:])\n", @@ -1940,36 +962,28 @@ }, { "cell_type": "markdown", - "id": "af1169b6", + "id": "51e2dc2b", "metadata": {}, "source": [ - "#### Continue including more models found at the NGC Catalog: \n", - "#### https://catalog.ngc.nvidia.com/models\n", - "##### - Recommend filtering by 'CT' " + "## Conclusions\n", + "Here you learned how to segment medical images using an NVIDIA pretrained model. To further explore NVIDIA models search the [NGC Catalog](https://catalog.ngc.nvidia.com/models) for 'CT'." ] }, { - "cell_type": "code", - "execution_count": null, - "id": "0dce4d55", - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e17e6228", + "cell_type": "markdown", + "id": "af3e7ddb", "metadata": {}, - "outputs": [], - "source": [] + "source": [ + "## Clean up\n", + "Make sure you shut down this VM, or delete it if you don't plan to use if further.\n", + "\n", + "You can also [delete the buckets](https://docs.aws.amazon.com/AmazonS3/latest/userguide/delete-bucket.html) if you don't want to pay for the data: `aws s3 rb s3://bucket-name --force`" + ] }, { - "cell_type": "code", - "execution_count": null, - "id": "7034135a", + "cell_type": "markdown", + "id": "e0b85a97", "metadata": {}, - "outputs": [], "source": [] } ], diff --git a/notebooks/pangolin/pangolin_pipeline.ipynb b/notebooks/pangolin/pangolin_pipeline.ipynb index 4ce0d5b..85ba27f 100644 --- a/notebooks/pangolin/pangolin_pipeline.ipynb +++ b/notebooks/pangolin/pangolin_pipeline.ipynb @@ -10,10 +10,44 @@ }, { "cell_type": "markdown", - "id": "56a29212", + "id": "22f95828", "metadata": {}, "source": [ - "We are going to run a standard covid bioinformatics pipeline using the Pangolin workflow. https://cov-lineages.org/resources/pangolin/usage.html" + "## Overview" + ] + }, + { + "cell_type": "markdown", + "id": "25e25f08", + "metadata": {}, + "source": [ + "We are going to run a standard covid bioinformatics pipeline using the [Pangolin workflow](https://cov-lineages.org/resources/pangolin/usage.html). We will run the whole analysis within this notebook environment." + ] + }, + { + "cell_type": "markdown", + "id": "0f67dfae", + "metadata": {}, + "source": [ + "## Learning Objectives\n", + "Learn how to run a simple bioinformatic workflow within a Jupyter notebook environment on AWS." + ] + }, + { + "cell_type": "markdown", + "id": "e7a574ce", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "+ You only need access to a Sagemaker environment to run this notebook" + ] + }, + { + "cell_type": "markdown", + "id": "2881a142", + "metadata": {}, + "source": [ + "## Get Started" ] }, { @@ -21,7 +55,7 @@ "id": "03541941", "metadata": {}, "source": [ - "### Required software" + "### Install packages and set up environment" ] }, { @@ -33,8 +67,10 @@ }, "outputs": [], "source": [ - "#change this depending on how many threads are available in your notebook\n", - "CPU=4" + "import os\n", + "\n", + "CPU = os.cpu_count()\n", + "print(f\"Number of threads available: {CPU}\")\n" ] }, { @@ -46,10 +82,34 @@ }, "outputs": [], "source": [ - "#install biopython to import packages below\n", + "# install biopython to import packages below\n", "! pip install biopython" ] }, + { + "cell_type": "code", + "execution_count": null, + "id": "70810fd5", + "metadata": {}, + "outputs": [], + "source": [ + "# install mamba\n", + "! curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh\n", + "! bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3379e83c", + "metadata": {}, + "outputs": [], + "source": [ + "# add to your path\n", + "import os\n", + "os.environ[\"PATH\"] += os.pathsep + os.environ[\"HOME\"]+\"/mambaforge/bin\"" + ] + }, { "cell_type": "code", "execution_count": null, @@ -59,6 +119,7 @@ }, "outputs": [], "source": [ + "# install everything else\n", "! mamba install -y -c conda-forge -c bioconda sra-tools pangolin iqtree" ] }, @@ -96,6 +157,7 @@ "source": [ "if not os.path.exists('pangolin_analysis'):\n", " os.mkdir('pangolin_analysis')\n", + "\n", "os.chdir('pangolin_analysis')" ] }, @@ -110,8 +172,9 @@ "source": [ "if os.path.exists('sarscov2_sequences.fasta'):\n", " os.remove('sarscov2_sequences.fasta')\n", - "!rm sarscov2_*\n", - "!rm lineage_report.csv" + "\n", + "! rm sarscov2_*\n", + "! rm lineage_report.csv" ] }, { @@ -133,6 +196,7 @@ "source": [ "#give a list of accession number for sars sequences\n", "acc_nums=['NC_045512','LR757995','LR757996','OL698718','OL677199','OL672836','MZ914912','MZ916499','MZ908464','MW580573','MW580574','MW580576','MW991906','MW931310','MW932027','MW424864','MW453109','MW453110']\n", + "\n", "print('the number of sequences we will analyze = ',len(acc_nums))" ] }, @@ -153,10 +217,12 @@ }, "outputs": [], "source": [ - "#use the bio.entrez toolkit within biopython to download the accession numbers\n", - "#save those sequences to a single fasta file\n", + "# use the bio.entrez toolkit within biopython to download the accession numbers\n", + "# save those sequences to a single fasta file\n", "Entrez.email = \"email@example.com\" # ell NCBI who you are\n", + "\n", "filename = \"sarscov2_seqs.fasta\"\n", + "\n", "if not os.path.isfile(filename):\n", " # Downloading...\n", " for acc in acc_nums:\n", @@ -179,8 +245,10 @@ }, "outputs": [], "source": [ - "#make sure our fasta file has the same number of seqs as the acc_nums list\n", + "# make sure our fasta file has the same number of seqs as the acc_nums list\n", + "\n", "print('the number of seqs in our fasta file: ')\n", + "\n", "! grep '>' sarscov2_seqs.fasta | wc -l" ] }, @@ -193,7 +261,7 @@ }, "outputs": [], "source": [ - "#let's peek at our new fasta file\n", + "# let's peek at our new fasta file\n", "! head sarscov2_seqs.fasta" ] }, @@ -205,7 +273,7 @@ }, "source": [ "### Run pangolin to identify lineages and output alignment\n", - "Here we call pangolin, give it our input sequences and the number of threads. We also tell it to output the alignment. The full list of pangolin parameters can be found in the [docs](https://cov-lineages.org/resources/pangolin/usage.html)." + "Here we call pangolin, give it our input sequences and the number of threads. We also tell it to output the alignment. The full list of pangolin parameters can be found in the pangolin [docs](https://cov-lineages.org/resources/pangolin/usage.html)." ] }, { @@ -246,7 +314,7 @@ }, "outputs": [], "source": [ - "#run iqtree with threads = $CPU variable.\n", + "# run iqtree with threads = $CPU variable.\n", "# if you exclude the -m it will do a phylogenetic model search before tree search\n", "! iqtree -s sequences.aln.fasta -nt $CPU -m HKY --prefix sarscov2_tree --redo-tree" ] @@ -256,7 +324,16 @@ "id": "c7197dd4", "metadata": {}, "source": [ - "### Download the tree and view in tree viewer like FigTree! " + "### Download the tree and view in tree viewer like [FigTree](http://tree.bio.ed.ac.uk/software/figtree/)! " + ] + }, + { + "cell_type": "markdown", + "id": "7a5b8a1b", + "metadata": {}, + "source": [ + "### Conclusions\n", + "That's it! Now you know how to run a simple workflow using a Sagemaker notebook environment" ] }, { @@ -264,15 +341,16 @@ "id": "88457512", "metadata": {}, "source": [ - "And that is all! You now know how to run workflows in notebooks in Cloud Lab" + "## Clean up\n", + "Make sure you shut down this VM, or delete it if you don't plan to use if further.\n", + "\n", + "You can also [delete the buckets](https://docs.aws.amazon.com/AmazonS3/latest/userguide/delete-bucket.html) if you don't want to pay for the data: `aws s3 rb s3://bucket-name --force`" ] }, { - "cell_type": "code", - "execution_count": null, - "id": "05b918cf-e78a-4691-aeaf-8a85c6e6d831", + "cell_type": "markdown", + "id": "4eb73656", "metadata": {}, - "outputs": [], "source": [] } ], @@ -284,9 +362,9 @@ "uri": "gcr.io/deeplearning-platform-release/r-cpu.4-1:m87" }, "kernelspec": { - "display_name": "conda_python3", + "display_name": "Python 3", "language": "python", - "name": "conda_python3" + "name": "python3" }, "language_info": { "codemirror_mode": { @@ -298,7 +376,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.12" + "version": "3.9.6" } }, "nbformat": 4, diff --git a/notebooks/rnaseq-myco-tutorial-main/RNAseq_pipeline.ipynb b/notebooks/rnaseq-myco-tutorial-main/RNAseq_pipeline.ipynb index e7ea5c0..6288040 100644 --- a/notebooks/rnaseq-myco-tutorial-main/RNAseq_pipeline.ipynb +++ b/notebooks/rnaseq-myco-tutorial-main/RNAseq_pipeline.ipynb @@ -20,6 +20,10 @@ "source": [ "This short tutorial demonstrates how to run an RNA-Seq workflow using a prokaryotic data set. Steps in the workflow include read trimming, read QC, read mapping, and counting mapped reads per gene to quantitate gene expression.\n", "\n", + "This tutorials uses example sequence data procured from the Sally Molloy labratory at the University of Maine; which investigates the transcriptome changes in prophage infected, versus non-prophage infected M. chelonae bacteria. The respective article can be found [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8191103/).\n", + "\n", + "\n", + "\n", "![RNA-Seq workflow](images/rnaseq-workflow.png)" ] }, @@ -27,7 +31,30 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### STEP 1: Setup Environment" + "## Learning Objectives\n", + "Learn how to run a simple RNAseq analysis in a Sagemaker environment" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "+ You only need access to a SageMaker environment to run this notebook" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Get Started" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Set up environment and install packages" ] }, { @@ -62,6 +89,29 @@ "metadata": {}, "outputs": [], "source": [ + "# install mamba\n", + "! curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh\n", + "! bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# add to your path\n", + "import os\n", + "os.environ[\"PATH\"] += os.pathsep + os.environ[\"HOME\"]+\"/mambaforge/bin\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# install everything else\n", "! mamba install -c conda-forge -c bioconda -c defaults -y sra-tools pigz=2.6 pbzip2=1.1 trimmomatic=0.36 fastqc=0.11.9 multiqc=1.10.1 salmon=1.5.1 " ] }, @@ -78,21 +128,15 @@ "metadata": {}, "outputs": [], "source": [ - "%%bash\n", - "mkdir -p data\n", - "mkdir -p data/raw_fastq\n", - "mkdir -p data/trimmed\n", - "mkdir -p data/fastqc\n", - "mkdir -p data/aligned\n", - "mkdir -p data/reference" + "! mkdir -p data data/raw_fastq data/trimmed data/fastqc data/aligned data/reference" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### STEP 2: Copy FASTQ Files\n", - "In order for this tutorial to run quickly, we will only analyze 50,000 reads from a sample from both sample groups instead of analyzing all the reads from all six samples. These files have been posted on a Google Storage Bucket that we made publicly accessible." + "### Copy FASTQ Files\n", + "So that this tutorial runs quickly, we will only analyze 50,000 reads from one sample from two treatment groups instead of analyzing all the reads from all six samples. These files are hosted in a public Google storage bucket." ] }, { @@ -112,7 +156,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### STEP 3: Copy reference transcriptome files that will be used by Salmon\n", + "### Copy reference transcriptome files that will be used by Salmon\n", "Salmon is a tool that aligns RNA-Seq reads to a set of transcripts rather than the entire genome." ] }, @@ -130,7 +174,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### STEP 4: Copy data file for Trimmomatic" + "### Copy data file for Trimmomatic" ] }, { @@ -146,7 +190,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### STEP 5: Run Trimmomatic\n", + "### Run Trimmomatic\n", "Trimmomatic will trim off any adapter sequences or low quality sequence it detects in the FASTQ files." ] }, @@ -165,7 +209,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### STEP 6: Run FastQC\n", + "### Run FastQC\n", "FastQC is an invaluable tool that allows you to evaluate whether there are problems with a set of reads. For example, it will provide a report of whether there is any bias in the sequence composition of the reads." ] }, @@ -191,8 +235,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### STEP 7: Run MultiQC\n", - "MultiQC reads in the FastQQ reports and generate a compiled report for all the analyzed FASTQ files." + "### Run MultiQC\n", + "MultiQC reads in the FastQC reports and generates a compiled report for all the analyzed FASTQ files." ] }, { @@ -217,7 +261,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### STEP 8: Index the Transcriptome so that Trimmed Reads Can Be Mapped Using Salmon" + "### Index the Transcriptome so that trimmed reads can be mapped using salmon" ] }, { @@ -233,7 +277,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### STEP 9: Run Salmon to Map Reads to Transcripts and Quantify Expression Levels\n", + "### Run salmon to map reads to transcripts and quantify expression levels\n", "Salmon aligns the trimmed reads to the reference transcriptome and generates the read counts per transcript. In this analysis, each gene has a single transcript." ] }, @@ -252,7 +296,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### STEP 10: Report the top 10 most highly expressed genes in the samples" + "### Report the top 10 most highly expressed genes in the samples" ] }, { @@ -291,7 +335,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### STEP 11: Report the expression of a putative acyl-ACP desaturase (BB28_RS16545) that was downregulated in the double lysogen relative to wild-type\n", + "### Report the expression of a putative acyl-ACP desaturase (BB28_RS16545) that was downregulated in the double lysogen relative to wild-type\n", "A acyl-transferase was reported to be downregulated in the double lysogen as shown in the table of the top 20 upregulated and downregulated genes from the paper describing the study." ] }, @@ -333,8 +377,24 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### That's it! " + "## Conclusions\n", + "Here you worked through a simple RNAseq analysis within a Sagemaker environment. For more RNAseq examples, check out the [NIGMS Sandbox RNAseq module}(https://github.com/NIGMS/RNA-Seq-Differential-Expression-Analysis). " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Clean up\n", + "Make sure you shut down this VM, or delete it if you don't plan to use if further.\n", + "\n", + "You can also [delete the buckets](https://docs.aws.amazon.com/AmazonS3/latest/userguide/delete-bucket.html) if you don't want to pay for the data: `aws s3 rb s3://bucket-name --force`" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] } ], "metadata": { diff --git a/tutorials/README.md b/tutorials/README.md deleted file mode 100644 index 3be38d5..0000000 --- a/tutorials/README.md +++ /dev/null @@ -1,96 +0,0 @@ -# AWS Tutorial Resources - ---------------------------------- -## Overview of Page Contents - -+ [Biomedical Workflows on AWS](#bio) -+ [Artificial Intelligence](#ai) -+ [Clinical Informatics](#ci) -+ [Download SRA Data](#sra) -+ [GWAS](#gwas) -+ [Medical Imaging](#im) -+ [RNAseq](#rna) -+ [scRNAseq](#sc) -+ [BLAST](#bl) -+ [Protein Folding](#af) -+ [Long Read Sequencing Analysis](#long) -+ [Drug Discovery](#atom) -+ [CryoEM](#cryoem) -+ [Open Data](#open) - ---------------------------------- -## **Biomedical Workflows on AWS** - -There are a lot of ways to run workflows on AWS. Here we list a few possibilities each of which may work for different research aims. As you walk through the various tutorials below, think about how you could possibly run that workflow more efficiently using one of the other methods listed here. If you are unfamiliar with any of the terms or concepts here, please review the [AWS Jumpstart](https://cloud.nih.gov/resources/cloudlab/aws-jumpstart/) page. - -- The most simple is probably to spin up an EC2 instance, and run your command interactively, or using `screen` or, as a [startup script](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html) attached as metadata. See the [GWAS tutorial](https://training.nih-cfde.org/en/latest/Bioinformatic-Analyses/GWAS-in-the-cloud) below for more info on how to run a pipeline using EC2. -- You could also run your pipeline via a SageMaker notebook, either by splitting out each command as a different block, or by running a workflow manager (Nextflow etc.). See [here](https://aws.amazon.com/blogs/machine-learning/scheduling-jupyter-notebooks-on-sagemaker-ephemeral-instances/) about scheduling a notebook to let it run longer. You can find some example notebooks in the [tutorials below](/notebooks/). -- If you are running bioinformatic workflows, you can leverage the serverless functionality of AWS using [Amazon HealthOmics](https://aws.amazon.com/healthomics/). Read [this blog](https://aws.amazon.com/blogs/industries/automated-end-to-end-genomics-data-storage-and-analysis-using-amazon-omics/) for more detailed information and also see if any new blogs have come out. If you want to get some hands on experience with HealthOmics using Cloud Lab, follow [this on-demand workshop](https://catalog.workshops.aws/amazon-omics-end-to-end/en-US/001-getting-started/010-self-directed-workshop) from Amazon! Since you already have an account set up, skip directly to the _Workshop_ section and then you can decide if you want to complete the tutorial via the console, the CLI, or via Notebooks. If you go the notebook route, just spin up a notebook via [Sagemaker](/docs/Jupyter_notebook.md). If you want to create a private workflow using Nextflow, you will need to migrate your containers to a private Amazon Elastic Container Registry (ECR). You can follow [this workshop](https://catalog.us-east-1.prod.workshops.aws/workshops/76d4a4ff-fe6f-436a-a1c2-f7ce44bc5d17/en-US) to learn how that process works. -- If you are using a workflow manager other than WDL, Nextflow, or CWL (e. g. Snakemake), use [AWS Genomics CLI](https://aws.amazon.com/genomics-cli/), which is a wrapper for genomics workflow managers and AWS Batch (serverless computing cluster). See our [docs](/docs/agc.md) on how to set up the AGC CLI for Cloud Lab. You can also just run Snakemake locally within a VM. See our [Pangolin tutorial](/notebooks/pangolin) for one example. -- Finally, one benefit of the cloud is access to GPUs for workflow acceleration. While a lot of focus on GPU implementation will focus on AI/ML workflows, NVIDIA has software called Parabricks that will accelerate genomic workflows for pretty low costs. See the full list of command line options [here](https://docs.nvidia.com/clara/parabricks/3.7.0/index.html) to see if your specific workflow is accelerated. The easiest way to run Parabricks right now is via AWS HealthOmics [Ready2Run workflows](https://docs.aws.amazon.com/omics/latest/dev/service-workflows.html), but to run it via EC2 see our [guide](/docs/parabricks.md). - -**For many of these tutorials, you will need Short Term Access Keys to create and use resources, particularly whenever a tutorial calls for "access key ID" and "secret key." Use [this guide](/docs/Intramural_STAKs.md) for an explanation of how to obtain and use Short Term Access Keys. If you are an NIH-affiliated researcher, in other words, you don't work at the NIH but have a Cloud Lab account, you will not have access to keys. If there is a tutorial you are unable to complete, reach out to us for help at CloudLab@nih.gov** - - **Please also note, GPU machines cost more than most CPU machines, so be sure to shut these machines down after use, or apply an EC2 [lifecycle configuration](/docs/auto-shutdown-instance.md). You may also encounter service quotas to protect you from the accidental use of expensive machine types. If that happens, and you still want to use a certain instance type, follow these [instructions](/docs/service_quotas.md).** - -## **Artificial Intelligence** -Machine learning is a subfield of artificial intelligence that focuses on the development of algorithms and models that enable computers to learn from and make predictions or decisions based on data, without being explicitly programmed. Artificial intelligence and machine learning algorithms are being applied to a variety of biomedical research questions, ranging from image classification to genomic variant calling. AWS has a long list of AI/ML tutorials available and we have compiled a list here. Most recent development focuses on generative AI including use cases such as extracting information from text, transforming speech to text, and generating images from text. Sagemaker Studio allows the user to rapidly create, test, and train generative AI models and has ready to use models all contained with [JumpStart](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-jumpstart.html). These models range from foundation models, fine-tunable models, and task-specific solutions. -+ For examples of generative AI, view our [GenAI tutorials](/notebooks/GenAI) that use several AWS products such as [Bedrock](/notebooks/GenAI/AWS_Bedrock_Intro.ipynb) and [Jumpstart](/notebooks/GenAI/AWS_GenAI_Jumpstart.ipynb) and utilizes other tools like [Langchain](/notebooks/GenAI/Pubmed_RAG_chatbot.ipynb) and [Huggingface](/notebooks/GenAI/AWS_GenAI_Huggingface.ipynb) to deploy, train, prompt, and implement techniques like [Retrieval-Augmented Generation (RAG)](/notebooks/GenAI/Pubmed_RAG_chatbot.ipynb) to GenAI models. Also take a look at the [AWS GitHub repo](https://github.com/aws-samples/amazon-sagemaker-generativeai) for more Gen AI tutorials. -+ For other AI use cases, we recommend you start with this comprehensive [on-demand workshop](https://catalog.workshops.aws/hcls-aiml/en-US/breast-cancer-classification) on how to use SageMaker Studio for a variety of AI/ML use cases including applying a classifier to RNAseq data, classifying tabular breast cancer data, buiding graph neural nets on HIV data, training a medical imaging model on chest scans, summarize scientific literature using foundation models, MLOps using gene expression data, and finally, performing antibody structure prediction. -+ To learn more about Bedrock check out this [on-demand workshop](https://catalog.us-east-1.prod.workshops.aws/workshops/a4bdb007-5600-4368-81c5-ff5b4154f518/en-US) featuring uses cases for prompt engineering, summarization, Q/A, chatbot, and image, code, and text generation within Bedrock. -+ AWS has a very general tutorial [here](https://aws.amazon.com/getting-started/hands-on/build-train-deploy-machine-learning-model-sagemaker/) on how to build out an AI pipeline on SageMaker. -+ These [general examples](https://github.com/aws/amazon-sagemaker-examples/tree/main/introduction_to_applying_machine_learning) will teach you how to use Sagemaker tools more broadly. -+ You can also submit a training job to SageMaker, and have your final model uploaded to S3 using [PyTorch](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#train-a-model-with-pytorch), [Tensorflow](https://docs.aws.amazon.com/sagemaker/latest/dg/tf.html) or [Apache MXNet](https://docs.aws.amazon.com/sagemaker/latest/dg/mxnet.html). - -## **Clinical Informatics** -Clinical informatics, also known as healthcare informatics or medical informatics, is an interdisciplinary field that applies data science to healthcare data to improve patient care, enhance clinical processes, and facilitate medical research. It often involves integrating diverse data types including electronic health records, demographic, or environmental data. AWS offers two on demand workshops that walk you through AWS HealthLake for Population Health data analysis. [This first workshop](https://catalog.us-east-1.prod.workshops.aws/workshops/4849824d-084a-4a64-a237-f05027f54abc/en-US) shows you how to ingest data to HealthLake, query those data using Athena, visualize these data using QuickSight, then join FHIR data with environmental data and visualize the combined dataset. [The second workshop](https://catalog.us-east-1.prod.workshops.aws/workshops/498fdbc5-46e1-4cb0-97a0-f4ec3a30f26a/en-US/activity-streaming-data) also ingests data into HealthLake, then visualizes medical device data, uses AI to summarize clinical notes, and then transcribes clinical audio files and summarizes them. - -## **Download Data From the Sequence Read Archive (SRA)** -Next Generation genetic sequence data is housed in the NCBI Sequence Read Archive (SRA). You can access these data using the SRA Toolkit. We walk you through this using [this notebook](/notebooks/SRADownload), which also walks you through how to set up and search Athena tables to generate an accession list. You can also read [this guide](https://www.ncbi.nlm.nih.gov/sra/docs/sra-aws-download/) for more information on available dataset tables. Additional example notebooks can be found at this [NCBI repo](https://github.com/ncbi/ASHG-Workshop-2021). In particular, we recommend this notebook(https://github.com/ncbi/ASHG-Workshop-2021/blob/main/3_Biology_Example_AWS_Demo.ipynb), which goes into more detail on using Athena to access the results of the SRA Taxonomic Analysis Tool, which often differ from the user input species name due to contamination, error, or due to samples being metagenomic in nature. - -## **Genome Wide Association Studies** -Genome-wide association studies (GWAS) are large-scale investigations that analyze the genomes of many individuals to identify common genetic variants associated with traits, diseases, or other phenotypes. -- This [NIH CFDE written tutorial](https://training.nih-cfde.org/en/latest/Bioinformatic-Analyses/GWAS-in-the-cloud -) walks you through running a simple GWAS using EC2. The tutorials asks you to select the Ohio region, make sure you change your region to N. Virginia otherwise you will have network issues. Note that the CFDE page has a few other bioinformatics related tutorials like BLAST and Illumina read simulation. We also converted the GWAS tutorial to a simplified [notebook version](/notebooks/GWAS) if you prefer that format. See our [notebook guide](/docs/Jupyter_notebook.md) for help with setting up a Jupyter environment. - -## **Medical Imaging Analysis** -Medical imaging analysis requires the analysis of large image files and often requires elastic storage and accelerated computing. -- Most medical imaging analyses are done using notebooks, so we would recommend accessing this [Jupyter Notebook](/notebooks/SpleenLiverSegmentation) and cloning it into SageMaker. The tutorial walks through image segmentation. -- [This Sagemaker Studio on-demand workshop](https://catalog.workshops.aws/hcls-aiml/en-US/chest-xrays-object-detection) has a nice section on building a model on medical imaging data. -- You can also view this [AWS blog](https://aws.amazon.com/blogs/machine-learning/annotate-dicom-images-and-build-an-ml-model-using-the-monai-framework-on-amazon-sagemaker/) on how to annotate DICOM images and build a custom AI model with the data. -- You can learn to deidentify medical images following this AWS [tutorial](https://aws.amazon.com/blogs/machine-learning/de-identify-medical-images-with-the-help-of-amazon-comprehend-medical-and-amazon-rekognition/). - -## **RNAseq** -RNA-seq analysis is a high-throughput sequencing method that allows the measurement and characterization of gene expression levels and transcriptome dynamics. Workflows are typically run using workflow managers, and final results can often be visualized in notebooks. -- You can run this [Nextflow tutorial](https://nf-co.re/rnaseq/3.7) for RNAseq a variety of ways on AWS. Following the instructions outlined above, you could use EC2, SageMaker, or AWS Batch(/docs/Genomics_Workflows.md). -- [This AWS on-demand workshop](https://catalog.workshops.aws/hcls-aiml/en-US/breast-cancer-classification) shows how to analyze gene expression data using Amazon Sagemaker Studio. -- For a notebook version of a complete RNAseq pipeline from Fastq to Salmon quantification from the King Lab of the University of Maine INBRE use this [notebook](/notebooks/rnaseq-myco-tutorial-main), which we re-wrote to work on AWS. - -## **Single Cell RNAseq** -Single-cell RNA sequencing (scRNA-seq) is a technique that enables the analysis of gene expression at the individual cell level, providing insights into cellular heterogeneity, identifying rare cell types, and revealing cellular dynamics and functional states within complex biological systems. -- This [AWS blog](https://aws.amazon.com/blogs/publicsector/driving-innovation-single-cell-analysis-aws/) lays out a potential method that integrates a lot of the AWS native tools for running an scRNAseq pipeline. It is less of a tutorial, and more of a demo of what is possible. -- This [NVIDIA blog](https://developer.nvidia.com/blog/accelerating-single-cell-genomic-analysis-using-rapids/) details how to run an accelerated scRNAseq pipeline using RAPIDS. You can find a link to the GitHub that has lots of example notebooks [here](https://github.com/clara-parabricks/rapids-single-cell-examples). For each example use case they show some nice benchmarking data with time and cost for each machine type. You will see that most runs cost less than $1.00 with GPU machines. If you want a CPU version that users Scanpy you can use this [notebook](https://github.com/clara-parabricks/rapids-single-cell-examples/blob/master/notebooks/hlca_lung_cpu_analysis.ipynb). Pay careful attention to the environment setup as there are a lot of dependencies for these notebooks. Create a conda environment in the terminal, then run the notebook. Consider using [mamba](https://github.com/mamba-org/mamba) to speed up environment creation. We created a [guide](/docs/create_conda_env.md) for conda environment set up as well. - -## **ElasticBLAST** -NCBI BLAST (Basic Local Alignment Search Tool) is a widely used bioinformatics program provided by the National Center for Biotechnology Information (NCBI) that compares nucleotide or protein sequences against a large database to identify similar sequences and infer evolutionary relationships, functional annotations, and structural information. The NCBI team has written a version of BLAST for the cloud called ElasticBLAST, and you can read all about it [here](https://blast.ncbi.nlm.nih.gov/doc/elastic-blast/index.html). Essentially, ElasticBLAST helps you submit BLAST jobs to AWS Batch and write the results back to S3. Feel free to experiment with the example tutorial in Cloud Shell, or try our [notebook version](/notebooks/ElasticBLAST/run_elastic_blast.ipynb). - -## **Protein Folding** -You can run several protein folding algorithms including Alpha Fold on AWS. Because the databases are so large, the setup is normally pretty difficult, but AWS has created a StackFormation stack that automates spinning up all the resources necessary for running Alpha Fold and other protein folding algorithms. You can read about the AWS resources [here](https://aws.amazon.com/solutions/guidance/protein-folding-on-aws/), and view the GitHub page [here](https://github.com/aws-solutions-library-samples/aws-batch-arch-for-protein-folding). To get this to work, you will need to modify your security groups following [these instructions](https://docs.aws.amazon.com/fsx/latest/LustreGuide/limit-access-security-groups.html). You will also likely have to [grant additional permissions to the Role](/docs/update_sagemaker_role.md) that CloudFormation is using. If you get stuck, reach out to CloudLab@nih.gov. You can also run ESMFold [using this tutorial](https://catalog.workshops.aws/hcls-aiml/en-US/protein-analysis/esmfold). - -## **Long Read Sequence Analysis** -Long read DNA sequence analysis involves analyzing sequencing reads typically longer than 10 thousand base pairs (bp) in length, compared with short read sequencing where reads are about 150 bp in length. -Oxford Nanopore has a pretty complete offering of notebook tutorials for handling long read data to do a variety of things including variant calling, RNAseq, Sars-Cov-2 analysis and much more. Access the notebooks [here](https://labs.epi2me.io/nbindex/). These notebooks expect you are running locally and accessing the epi2me notebook server. To run them in Cloud Lab, skip the first cell that connects to the server and then the rest of the notebook should run correctly, with a few tweaks. If you are just looking to try out notebooks, don't start with these. If you are interested in long read sequence analysis, then some troubleshooting may be needed to adapt these to the Cloud Lab environment. You may even need to rewrite them in a fresh notebook by adapting the commands. Feel free to reach out to our support team for help. - -## **Drug Discovery** -The [Accelerating Therapeutics for Opportunities in Medicine (ATOM) Consortium](https://atomscience.org/) created a series of [Jupyter notebooks](https://github.com/ATOMScience-org/AMPL/tree/master/atomsci/ddm/examples/tutorials) that walk you through the ATOM approach to Drug Discovery. - -These notebooks were created to run in Google Colab, so if you run them in AWS, you will need to make a few modification. First, we recommend you use a [Sagemaker Studio Notebook](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated.html) rather than a User-Managed notebook simply because it will have Tensorflow and other dependencies installed. Be sure to attach a GPU to your instance (T4 is fine). Also, you will need to comment out `%tensorflow_version 2.x` since that is a Colab-specific command. You will also need to `pip install` a few packages as needed. If you get errors with `deepchem`, try running `pip install --pre deepchem[tensorflow]` and/or `pip install --pre deepchem[torch]`. Also, some notebooks will require a Tensorflow kernel, while others require Pytorch. You may also run into a Pandas error, reach out to the ATOM GitHub developers for the best solution, or review their issues. - -## **CryoEM** -Cryo-Electron Microscopy (cryoEM), is a powerful imaging technique used in structural biology to visualize the structures of biological macromolecules, such as proteins, nucleic acids, and large molecular complexes, at near-atomic or even atomic resolution. It has revolutionized the field of structural biology by providing detailed three-dimensional structures of biomolecules, which is crucial for understanding their functions. - -+ AWS created a [hands-on workshop](https://catalog.us-east-1.prod.workshops.aws/workshops/056181d4-c084-47be-954c-20ec22350b02/en-US) for you to stand up a cryoEM environment using RELION. -+ You can also read [this blog on how to set up cryoSPARC](https://aws.amazon.com/blogs/hpc/how-thermo-fisher-scientific-accelerated-cryo-em-using-aws-parallelcluster/), as well as [docs from cryoSPARC](https://guide.cryosparc.com/setup-configuration-and-management/cryosparc-on-aws). - -## **Open Data** -AWS has a lot of public data that you can integrate into your testing or use in your own research. You can access these datasets at the [Registry of Open Data on AWS](https://registry.opendata.aws/). There you can click on any of the datasets to view the S3 path to the data, as well as publications that have used those data and tutorials if available. To demonstrate, we can click the [gnomad dataset](https://registry.opendata.aws/broad-gnomad/), then get the S3 path and view the files at the command line by pasting `https://registry.opendata.aws/broad-gnomad/`. diff --git a/tutorials/notebooks/GenAI/Pubmed_RAG_chatbot.ipynb b/tutorials/notebooks/GenAI/Pubmed_RAG_chatbot.ipynb deleted file mode 100644 index 09aa19a..0000000 --- a/tutorials/notebooks/GenAI/Pubmed_RAG_chatbot.ipynb +++ /dev/null @@ -1,868 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "2edc6187-82ae-44e2-852f-2ad2712c93aa", - "metadata": {}, - "source": [ - "# Creating a PubMed Chatbot with Llama2" - ] - }, - { - "cell_type": "markdown", - "id": "3ecea2ad-7c65-4367-87e1-b021167c3a1d", - "metadata": {}, - "source": [ - "For this tutorial we are creating a PubMed chatbot that will answer questions by gathering information from documents we have provided via a index. The model we will be using today is a pretrained Llama2 model from Jumpstart.\n", - "\n", - "This tutorial will go over the following topics:\n", - "- Introduce langchain\n", - "- Explain the differences between zero-shot, one-shot, and few-shot prompting\n", - "- Practice using different document retrievers" - ] - }, - { - "cell_type": "markdown", - "id": "4d01e74b-b5b4-4be9-b16e-ec55419318ef", - "metadata": {}, - "source": [ - "### Deploy the Model" - ] - }, - { - "cell_type": "markdown", - "id": "9dbd13e7-afc9-416b-94dc-418a93e14587", - "metadata": {}, - "source": [ - "Identify which model we want to deploy from Jumpstart, in this case we are using Llama2." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "6b51bf71-d2e5-4afc-8569-338767b43b9c", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "(\n", - " model_id,\n", - " model_version,\n", - ") = (\n", - " \"meta-textgeneration-llama-2-7b-f\",\n", - " \"*\",\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "624c9bb8-3ce2-4240-b2b8-6b1bb93bb9f2", - "metadata": {}, - "source": [ - "Create an endpoint so that we can communicate with our model, send inputs, and retrieve outputs." - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "bf27747d-443f-47e7-9d2c-a8f5c5c6f3b8", - "metadata": { - "tags": [] - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/home/ec2-user/anaconda3/envs/python3/lib/python3.10/site-packages/pandas/core/computation/expressions.py:21: UserWarning: Pandas requires version '2.8.0' or newer of 'numexpr' (version '2.7.3' currently installed).\n", - " from pandas.core.computation.check import NUMEXPR_INSTALLED\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml\n", - "sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "For forward compatibility, pin to model_version='2.*' in your JumpStartModel or JumpStartEstimator definitions. Note that major version upgrades may have different EULA acceptance terms and input/output signatures.\n", - "For forward compatibility, pin to model_version='2.*' in your JumpStartModel or JumpStartEstimator definitions. Note that major version upgrades may have different EULA acceptance terms and input/output signatures.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "-----------------!" - ] - } - ], - "source": [ - "from sagemaker.jumpstart.model import JumpStartModel\n", - "\n", - "model = JumpStartModel(model_id=model_id)\n", - "predictor = model.deploy()" - ] - }, - { - "cell_type": "markdown", - "id": "163fabcb-a267-4279-a5ae-c93d0143c139", - "metadata": {}, - "source": [ - "Next we will print the endpoint name which we will need later to run our chatbot." - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "id": "1ad71f0d-3be5-4b03-9c1c-eb4585721fc8", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "endpoint_id=predictor.endpoint_name" - ] - }, - { - "cell_type": "markdown", - "id": "4f3e3ab1-5f7e-4028-a66f-9619926a2afd", - "metadata": {}, - "source": [ - "### PubMed API vs Kendra Index" - ] - }, - { - "cell_type": "markdown", - "id": "5a820eea-1538-4f40-86c4-eb14fe09e127", - "metadata": {}, - "source": [ - "Our chatbot will rely on documents to answer our questions to do so we are supplying it a **vector index**. A vector index or index is a data structure that enables fast and accurate search and retrieval of vector embeddings from a large dataset of objects. We will be working with two options for our index PubMed API vs Kendra Index." - ] - }, - { - "cell_type": "markdown", - "id": "7314b115-9433-460d-b275-78aa50f0a858", - "metadata": {}, - "source": [ - "**What is the difference?**\n", - "\n", - "The **PubMed API** is provided free by langchain to connect your model to more than **35 million citations** for biomedical literature from MEDLINE, life science journals, and online books. **Kendra index** is a AWS product that allows the user more **security and control** on which documents you wish to supply to your model. \n", - "\n", - "We will be exploring both methods to see which produces the best results!" - ] - }, - { - "cell_type": "markdown", - "id": "bcf1690d-e93d-4cd3-89c6-8d06b5a071a8", - "metadata": {}, - "source": [ - "#### Setting up a Kendra Index" - ] - }, - { - "cell_type": "markdown", - "id": "02053f4d-fad7-44ab-a7c3-cfa1c218240f", - "metadata": {}, - "source": [ - "If you choose to use a Kendra index to supply documents to your model follow the instructions below:" - ] - }, - { - "cell_type": "markdown", - "id": "1d1c9de7-4a06-4f85-b9ff-c8c9e51f8c70", - "metadata": {}, - "source": [ - "AWS marketplace has PubMed database named **PubMed CentralĀ® (PMC)** that contains free full-text archive of biomedical and life sciences journal article at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM). We will be subsetting this database to add documents to our Kendra index. Ensure that you have the correct roles and policies to allow your environment to connect to S3 buckets, SageMaker, and Kendra." - ] - }, - { - "cell_type": "markdown", - "id": "78418da3-f806-4b19-9980-1fa41083a75a", - "metadata": {}, - "source": [ - "The first step will be to create a bucket that we will later use as our data source for our index." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "99d49432-cf03-4f19-aa82-ef7f8bad5bde", - "metadata": {}, - "outputs": [], - "source": [ - "#make bucket\n", - "bucket = 'pubmed-chat-docs'\n", - "!aws s3 mb s3://{bucket}" - ] - }, - { - "cell_type": "markdown", - "id": "b6ad30ba-cee8-47f9-bc1e-ece8961ac66a", - "metadata": {}, - "source": [ - "We will then download the metadata file from the PMC bucket, this will list all of the articles within the PMC bucket and their bucket paths we will use this to subset the database into our own bucket." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7b395e34-062d-4f77-afee-3601d471954a", - "metadata": {}, - "outputs": [], - "source": [ - "#download the metadata file\n", - "!aws s3 cp s3://pmc-oa-opendata/oa_comm/txt/metadata/txt/oa_comm.filelist.txt . --sse" - ] - }, - { - "cell_type": "markdown", - "id": "93a8595a-767f-4cad-9273-62d8e2cf60d1", - "metadata": {}, - "source": [ - "We only want the metadata of the first 100 files." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "c26b0f29-2b07-43a6-800d-4aa5e957fe52", - "metadata": { - "tags": [] - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/home/ec2-user/anaconda3/envs/python3/lib/python3.10/site-packages/pandas/core/computation/expressions.py:21: UserWarning: Pandas requires version '2.8.0' or newer of 'numexpr' (version '2.7.3' currently installed).\n", - " from pandas.core.computation.check import NUMEXPR_INSTALLED\n" - ] - } - ], - "source": [ - "#import the file as a dataframe\n", - "import pandas as pd\n", - "import os\n", - "df = pd.read_csv('oa_comm.filelist.csv')\n", - "#first 100 files\n", - "first_100=df[0:101]\n", - "#save new metadata\n", - "first_100.to_csv('oa_comm.filelist_100.csv', index=False)" - ] - }, - { - "cell_type": "markdown", - "id": "abd1ae93-450e-4c79-83cc-ea46a1b507c1", - "metadata": {}, - "source": [ - "Lets look at our metadata! We can see that the bucket path to the files are under the **Key** column this is what we will use to loop through the PMC bucket and copy the first 100 files to our bucket." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ff77b2aa-ed1b-4d27-8163-fdaa7a304582", - "metadata": {}, - "outputs": [], - "source": [ - "first_100" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7d63a7e2-dbf1-49ec-bc84-b8c2c8bde62d", - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "#gather path to files in bucket\n", - "for i in first_100['Key']:\n", - " os.system(f'aws s3 cp s3://pmc-oa-opendata/{i} s3://{bucket}/docs/ --sse')" - ] - }, - { - "cell_type": "markdown", - "id": "c1b396c8-baa9-44d6-948c-2326dc514839", - "metadata": {}, - "source": [ - "We will also save our new metadata file to our bucket to help Kendra index our files." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "79e7076a-65c6-4e06-b84b-795ee7d4de00", - "metadata": {}, - "outputs": [], - "source": [ - "!aws s3 cp oa_comm.filelist_100.csv s3://{bucket}/docs/" - ] - }, - { - "cell_type": "markdown", - "id": "79fb7fe8-4904-4b19-a411-95bca27ea87d", - "metadata": {}, - "source": [ - "Now we can create our Kendra index, use the following instructions to create a index via the console or command line [Creating a Kendra index](https://docs.aws.amazon.com/kendra/latest/dg/create-index.html). To connect our bucket as a data source follow the instructions provided [here](https://docs.aws.amazon.com/kendra/latest/dg/data-source-s3.html) to do so via the console or AWS Python SDK." - ] - }, - { - "cell_type": "markdown", - "id": "07b3bc6b-8c43-476f-a662-abda830dc2da", - "metadata": { - "tags": [] - }, - "source": [ - "### Creating a Inference Script " - ] - }, - { - "cell_type": "markdown", - "id": "3ba2291e-109e-4120-ad10-5dbfd341a07b", - "metadata": {}, - "source": [ - "Inorder for us to fluidly send input and receive outputs from our chatbot we need to create a **inference script** that will format inputs in a way that the chatbot can understand and format outputs in a way we can understand. We will also be supplying instructions to the chatbot through the script.\n", - "\n", - "Our script will utilize **langchain** tools and packages to enable our model to:\n", - "- **Connect to sources of context** (e.g. providing our model with tasks and examples)\n", - "- **Rely on reason** (e.g. instruct our model how to answer based on provided context)\n", - "\n", - "**Warning**: The over all inference script must be run on the terminal via the command `python YOUR_SCRIPT.py`." - ] - }, - { - "cell_type": "markdown", - "id": "538f42ee-a502-4e56-9e85-4f6e3726ef7a", - "metadata": {}, - "source": [ - "**Warning:** The following tools must be installed via your terminal" - ] - }, - { - "cell_type": "markdown", - "id": "58582571-eae0-4440-afda-be1af6750e11", - "metadata": { - "jupyter": { - "outputs_hidden": true - }, - "tags": [] - }, - "source": [ - "`pip install \"langchain\" \"xmltodict\"`" - ] - }, - { - "cell_type": "markdown", - "id": "ad374085-c4b1-4083-85a5-90cba35846d6", - "metadata": {}, - "source": [ - "The first part of our script will be to list all the tools that are required. \n", - "- **PubMedRetriever:** Utilizes the langchain retriever tool to specifically retrieve PubMed documents from the PubMed API.\n", - "- **AmazonKendraRetriever:** Utilizes the langchain retriever tool to specifically retrieve documents stored in your Kendra index.\n", - "- **ConversationalRetrievalChain:** Allows the user to construct a conversation with the model and retrieves the outputs while sending inputs to the model.\n", - "- **PromptTemplate:** Allows the user to prompt the model to provide instructions, best method for zero and few shot prompting\n", - "- **LLMContentHandler:** Handles the content to and from the model by transforming the input to a format that model can accept and transforms the output from the model to string that the LLM class expects.\n", - "- **SagemakerEndpoint**: Connects to our endpoint in SageMaker and allows all of the tools listed above to connect to our model.\n" - ] - }, - { - "cell_type": "markdown", - "id": "6f0ad48d-c6c8-421a-a48b-88e979d15b57", - "metadata": { - "tags": [] - }, - "source": [ - "```python\n", - "from langchain.retrievers import PubMedRetriever\n", - "from langchain.retrievers import AmazonKendraRetriever\n", - "from langchain.llms import SagemakerEndpoint\n", - "from langchain.chains import ConversationalRetrievalChain\n", - "from langchain.prompts import PromptTemplate\n", - "from langchain.llms.sagemaker_endpoint import LLMContentHandler\n", - "import sys\n", - "import json\n", - "import os\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "900f4c31-71cd-4f39-8bfc-de098bdbaafc", - "metadata": {}, - "source": [ - "Second will build a class that will hold the functions we need to send inputs and retrieve outputs from our model. For the beginning of our class we will establish some colors to our text conversation with our chatbot which we will utilize later." - ] - }, - { - "cell_type": "markdown", - "id": "decbb901-f811-4b8e-a956-4c8c7f914ae2", - "metadata": { - "tags": [] - }, - "source": [ - "```python\n", - "class bcolors:\n", - " HEADER = '\\033[95m'\n", - " OKBLUE = '\\033[94m'\n", - " OKCYAN = '\\033[96m'\n", - " OKGREEN = '\\033[92m'\n", - " WARNING = '\\033[93m'\n", - " FAIL = '\\033[91m'\n", - " ENDC = '\\033[0m'\n", - " BOLD = '\\033[1m'\n", - " UNDERLINE = '\\033[4m'\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "ba36d057-5189-4075-a243-18996c6fc932", - "metadata": {}, - "source": [ - "Next is to create a function that will gather the necessary information to connect to our model, which will be the:\n", - "- Location\n", - "- Kendra Index ID **(only if you are using Kendra instead of the PubMed API)**\n", - "- Endpoint_name or ID" - ] - }, - { - "cell_type": "markdown", - "id": "3f7a244a-7e71-40d3-ae78-8e166dd3c7ee", - "metadata": {}, - "source": [ - "```python\n", - "def build_chain():\n", - " region = os.environ[\"AWS_REGION\"]\n", - " kendra_index_id = os.environ[\"KENDRA_INDEX_ID\"] #only needed is using a Kendra index instead of Pubmed API\n", - " endpoint_name = os.environ[\"LLAMA_2_ENDPOINT\"]\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "4e681908-df7a-4cfd-924d-5350eb06e770", - "metadata": {}, - "source": [ - "Next we will create a class named **'ContentHandeler'** that will transforms our inputs and outputs into a json format. For Llama2 to understand and accept our inputs we need to structure in a specific manner, this is done by the **transform_input** function:\n", - "```\n", - "{\n", - "\"inputs\": \n", - " [[\n", - " {\"role\": \"user\", \"content\": prompt},\n", - " ]],\n", - " **model_kwargs\n", - " }\n", - "```\n", - "Where `prompt` will be our instructions to our model (what the model is expected to do with our input) and `**model_kwargs` is where we provide our parameters.\n", - "\n", - "Our input is then encoded in a **UTF-8** format to convert our string into 0s and 1s. \n", - "\n", - "The next function in our class is named **transform_output**, this function will take the outputs sent from our model and decode them from 0s and 1s to strings." - ] - }, - { - "cell_type": "markdown", - "id": "d63f9b4c-202a-4170-9988-369bbe52fe6b", - "metadata": { - "tags": [] - }, - "source": [ - "```python\n", - "class ContentHandler(LLMContentHandler):\n", - " content_type = \"application/json\"\n", - " accepts = \"application/json\"\n", - "\n", - " def transform_input(self, prompt: str, model_kwargs: dict) -> bytes:\n", - " input_str = json.dumps({\"inputs\": \n", - " [[\n", - " {\"role\": \"user\", \"content\": prompt},\n", - " ]],\n", - " **model_kwargs\n", - " })\n", - " \n", - " return input_str.encode('utf-8')\n", - " \n", - " def transform_output(self, output: bytes) -> str:\n", - " response_json = json.loads(output.read().decode(\"utf-8\"))\n", - " \n", - " return response_json[0]['generation']['content']\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "dab1012f-ed20-47b9-9162-924e03e836d5", - "metadata": {}, - "source": [ - "Now that we have a class that handles our input and outputs in a format that our model can understand we use **SageMakerEndpoint** tool to connect to our endpoint that we made in SageMaker. \n", - "\n", - "Notice that we set our class as the **content_handler** and we specified other **model_kwargs** as our parameters to control the temperature, top p, and max number of new tokens the model should generate to process our output.\n" - ] - }, - { - "cell_type": "markdown", - "id": "8cadb1af-2c46-4ab1-92f9-6e0861f83324", - "metadata": { - "tags": [] - }, - "source": [ - "```python\n", - "content_handler = ContentHandler()\n", - "\n", - " llm=SagemakerEndpoint(\n", - " endpoint_name=endpoint_name, \n", - " region_name=region, \n", - " model_kwargs={\"parameters\": {\"max_new_tokens\": 1000, \"top_p\": 0.9,\"temperature\":0.6}},\n", - " endpoint_kwargs={\"CustomAttributes\":\"accept_eula=true\"},\n", - " content_handler=content_handler,\n", - " )\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "c44b4f91-0c64-459b-a6e9-8a955c0797c7", - "metadata": {}, - "source": [ - "We specify what our retriever both the PubMed and Kendra retriever are list, please only add one per script." - ] - }, - { - "cell_type": "markdown", - "id": "21c61724-23d3-4b49-8c72-cbd208bdb5df", - "metadata": { - "tags": [] - }, - "source": [ - "```python\n", - "retriever= PubMedRetriever()\n", - "retriever = AmazonKendraRetriever(index_id=kendra_index_id,region_name=region) #only use if using Kendra Index\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "ec8e464a-0931-444a-aa58-09ee0c4c9884", - "metadata": {}, - "source": [ - "Here we are constructing our **prompt_template**, this is where we can try zero-shot or few-shot prompting. Only add one method per script." - ] - }, - { - "cell_type": "markdown", - "id": "4431051e-0e84-408e-9821-f50a9b88c9c1", - "metadata": {}, - "source": [ - "#### Zero-shot prompting\n", - "\n", - "Zero-shot prompting does not require any additional training more so it gives a pre-trained language model a task or query to generate text (our output). The model relies on its general language understanding and the patterns it has learned during its training to produce relevant output. In our script we have connect our model to a **retriever** to make sure it gathers information from that retriever (this can be the PubMed API or Kendra). \n", - "\n", - "See below that the task is more like instructions notifying our model they will be asked questions which it will answer based on the info of the scientific documents provided from the index provided (this can be the PubMed API or Kendra index). All of this information is established as a **prompt template** for our model to receive." - ] - }, - { - "cell_type": "markdown", - "id": "c0316dc5-6274-4a5e-92e4-3d266ed6a4df", - "metadata": { - "tags": [] - }, - "source": [ - "```python\n", - "prompt_template = \"\"\"\n", - " Ignore everything before.\n", - " \n", - " Instruction:\n", - " Instructions:\n", - " I will provide you with research papers on a specific topic in English, and you will create a cumulative summary. \n", - " The summary should be concise and should accurately and objectively communicate the takeaway of the papers related to the topic. \n", - " You should not include any personal opinions or interpretations in your summary, but rather focus on objectively presenting the information from the papers. \n", - " Your summary should be written in your own words and ensure that your summary is clear, concise, and accurately reflects the content of the original papers. First, provide a concise summary then citations at the end.\n", - " \n", - " {question} Answer \"don't know\" if not present in the document. \n", - " {context}\n", - " Solution:\"\"\"\n", - " PROMPT = PromptTemplate(\n", - " template=prompt_template, input_variables=[\"context\", \"question\"],\n", - " )\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "edbe7032-8507-4d07-baab-1b3bf0e92074", - "metadata": {}, - "source": [ - "#### One-shot and Few-shot Prompting" - ] - }, - { - "cell_type": "markdown", - "id": "5614ea04-e1f8-4941-ae16-4359f718f98f", - "metadata": {}, - "source": [ - "One and few shot prompting are similar to one-shot prompting, in addition to giving our model a task just like before we have also supplied an example of how the our model structure our output.\n", - "\n", - "See below that we have implemented one-shot prompting to our script. " - ] - }, - { - "cell_type": "markdown", - "id": "5ffb9669-5b77-4d9b-9f4e-a0d3a18b0fae", - "metadata": {}, - "source": [ - "```python\n", - "prompt_template = \"\"\"\n", - " Instructions:\n", - " I will provide you with research papers on a specific topic in English, and you will create a cumulative summary. \n", - " The summary should be concise and should accurately and objectively communicate the takeaway of the papers related to the topic. \n", - " You should not include any personal opinions or interpretations in your summary, but rather focus on objectively presenting the information from the papers. \n", - " Your summary should be written in your own words and ensure that your summary is clear, concise, and accurately reflects the content of the original papers. First, provide a concise summary then citations at the end. \n", - " Examples:\n", - " Question: What is a cell?\n", - " Answer: '''\n", - " Cell, in biology, the basic membrane-bound unit that contains the fundamental molecules of life and of which all living things are composed. \n", - " Sources: \n", - " Chow, Christopher , Laskey, Ronald A. , Cooper, John A. , Alberts, Bruce M. , Staehelin, L. Andrew , \n", - " Stein, Wilfred D. , Bernfield, Merton R. , Lodish, Harvey F. , Cuffe, Michael and Slack, Jonathan M.W.. \n", - " \"cell\". Encyclopedia Britannica, 26 Sep. 2023, https://www.britannica.com/science/cell-biology. Accessed 9 November 2023.\n", - " '''\n", - " \n", - " {question} Answer \"don't know\" if not present in the document. \n", - " {context}\n", - " \n", - "\n", - " \n", - " Solution:\"\"\"\n", - " PROMPT = PromptTemplate(\n", - " template=prompt_template, input_variables=[\"context\", \"question\"],\n", - " )\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "82c66d53-97b2-46dc-a466-70a3d3bee4a7", - "metadata": {}, - "source": [ - "The following set of commands control the chat history essentially telling the model to expect another question after it finishes answering the previous one. Follow up questions can contain references to past chat history so the **ConversationalRetrievalChain** combines the chat history and the followup question into a standalone question, then looks up relevant documents from the retriever, and finally passes those documents and the question to a question-answering chain to return a response.\n", - "\n", - "All of these pieces such as our conversational chain, prompt, and chat history are passed through a function called **run_chain** so that our model can return is response. We have also set the length of our chat history to one meaning that our model can only refer to the pervious conversation as a reference." - ] - }, - { - "cell_type": "markdown", - "id": "fda4d33b-60f2-4462-a8e6-bbce7f8a7b07", - "metadata": {}, - "source": [ - "```python\n", - "condense_qa_template = \"\"\"\n", - " Chat History:\n", - " {chat_history}\n", - " Here is a new question for you: {question}\n", - " Standalone question:\"\"\"\n", - " standalone_question_prompt = PromptTemplate.from_template(condense_qa_template)\n", - " \n", - " qa = ConversationalRetrievalChain.from_llm(\n", - " llm=llm, \n", - " retriever=retriever, \n", - " condense_question_prompt=standalone_question_prompt, \n", - " return_source_documents=True, \n", - " combine_docs_chain_kwargs={\"prompt\":PROMPT},\n", - " )\n", - " return qa\n", - "\n", - "def run_chain(chain, prompt: str, history=[]):\n", - " print(prompt)\n", - " return chain({\"question\": prompt, \"chat_history\": history})\n", - "\n", - "MAX_HISTORY_LENGTH = 1 #increase to refer to more pervious chats\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "b8f1ef8d-66fe-4f84-933b-af2d730bd114", - "metadata": {}, - "source": [ - "The final part of our script utilizes our class and incorporates colors to add a bit of flare to our conversation with our model. The model when first initialized should greet the user asking **\"Hello! How can I help you?\"** then instructs the user to ask a question or exit the session **\"Ask a question, start a New search: or CTRL-D to exit.\"**. With every question submitted to the model it is labeled as a **new search** we then run the run_chain function to get the models response or answer and add the response to the **chat history**. " - ] - }, - { - "cell_type": "markdown", - "id": "1aa6ef65-ced4-445e-875c-7fee3483b81d", - "metadata": {}, - "source": [ - "```python\n", - "if __name__ == \"__main__\":\n", - " chat_history = []\n", - " qa = build_chain()\n", - " print(bcolors.OKBLUE + \"Hello! How can I help you?\" + bcolors.ENDC)\n", - " print(bcolors.OKCYAN + \"Ask a question, start a New search: or CTRL-D to exit.\" + bcolors.ENDC)\n", - " print(\">\", end=\" \", flush=True)\n", - " for query in sys.stdin:\n", - " if (query.strip().lower().startswith(\"new search:\")):\n", - " query = query.strip().lower().replace(\"new search:\",\"\")\n", - " chat_history = []\n", - " elif (len(chat_history) == MAX_HISTORY_LENGTH):\n", - " chat_history.pop(0)\n", - " result = run_chain(qa, query, chat_history)\n", - " chat_history.append((query, result[\"answer\"]))\n", - " print(bcolors.OKGREEN + result['answer'] + bcolors.ENDC)\n", - " ###this if statment is not needed for PubMed Retreiver users\n", - " if 'source_documents' in result: \n", - " print(bcolors.OKGREEN + 'Sources:')\n", - " for d in result['source_documents']:\n", - " print(d.metadata['source'])\n", - " ###\n", - " print(bcolors.ENDC)\n", - " print(bcolors.OKCYAN + \"Ask a question, start a New search: or CTRL-D to exit.\" + bcolors.ENDC)\n", - " print(\">\", end=\" \", flush=True)\n", - " print(bcolors.OKBLUE + \"Bye\" + bcolors.ENDC)\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "1abcbd48-bb84-4310-b8eb-ad87850a8649", - "metadata": {}, - "source": [ - "Running our script in the terminal will require us to export the following global variables then running our python script. Dont forget to run you python script on the terminal use the command `python NAME_OF_YOUR_SCRIPT.py`. For more guidence take a look at our **example inference scripts** for the [PubMed API](/example_scripts/langchain_chat_llama_2_zeroshot.py) and [Kendra](/example_scripts/kendra_chat_llama_2.py)." - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "ba97df23-6893-438d-8a67-cb7dbf83e407", - "metadata": { - "tags": [] - }, - "outputs": [ - { - "data": { - "text/plain": [ - "'meta-textgeneration-llama-2-7b-f-2023-11-21-20-18-40-341'" - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "#retreive our endpoint id\n", - "endpoint_id" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7eab00a3-54ff-4873-8d25-eaf8bd18a2e6", - "metadata": {}, - "outputs": [], - "source": [ - "#enter the global variables in your terminal\n", - "export AWS_REGION=''\n", - "export LLAMA_2_ENDPOINT=''\n", - "export KENDRA_INDEX_ID=''" - ] - }, - { - "cell_type": "markdown", - "id": "bbe127e6-c0b1-4e07-ad56-38c30a9bf858", - "metadata": { - "tags": [] - }, - "source": [ - "You should see similar results on the terminal. In this example we ask the chatbot to explain brain cancer!" - ] - }, - { - "cell_type": "markdown", - "id": "80c8fb4b-e74f-4e8d-892b-0f913eff747d", - "metadata": {}, - "source": [ - "![PubMed Chatbot Results](../../../docs/images/PubMed_chatbot_results.png)" - ] - }, - { - "cell_type": "markdown", - "id": "a178c1c6-368a-48c5-8beb-278443b685a2", - "metadata": {}, - "source": [ - "### Clean Up" - ] - }, - { - "cell_type": "markdown", - "id": "7ec06a34-dc47-453f-b519-424804fa2748", - "metadata": {}, - "source": [ - "**Warning:** Dont forget to delete the resources we just made to avoid accruing additional costs!" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "c307bb17-757a-4579-a0d8-698eb1bb3f2e", - "metadata": {}, - "outputs": [ - { - "ename": "NameError", - "evalue": "name 'endpoint' is not defined", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", - "Cell \u001b[0;32mIn[2], line 3\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m#Delete model and endpoint\u001b[39;00m\n\u001b[1;32m 2\u001b[0m \u001b[38;5;66;03m#model.delete()\u001b[39;00m\n\u001b[0;32m----> 3\u001b[0m \u001b[43mendpoint\u001b[49m\u001b[38;5;241m.\u001b[39mdelete()\n", - "\u001b[0;31mNameError\u001b[0m: name 'endpoint' is not defined" - ] - } - ], - "source": [ - "#Delete model and endpoint\n", - "model.delete()\n", - "endpoint.delete()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "280cea0a-a8fc-494e-8ce4-afb65847a222", - "metadata": {}, - "outputs": [], - "source": [ - "#Delete bucket\n", - "!aws s3 rb s3://$bucket --force " - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "conda_python3", - "language": "python", - "name": "conda_python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.13" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -}