diff --git a/docs/Install_AWSParallelCluster.md b/docs/Install_AWSParallelCluster.md index 88b166b..7f06cc9 100644 --- a/docs/Install_AWSParallelCluster.md +++ b/docs/Install_AWSParallelCluster.md @@ -2,43 +2,34 @@ ## To install an instance of the AWS ParallelCluster UI choose an AWS Cloud Formation quick-create link for the AWS region you want the cluster in. - The AWS ParallelCluster UI is a web-based user interface that mirrors the AWS ParallelCluster pcluster CLI, while providing a console-like experience. You install and access the AWS ParallelCluster UI in your AWS account. When you run it, the AWS ParallelCluster UI accesses an instance of the AWS ParallelCluster API hosted on Amazon API Gateway in your AWS account. +The AWS ParallelCluster UI is a web-based user interface that mirrors the AWS ParallelCluster pcluster CLI, while providing a console-like experience. You install and access the AWS ParallelCluster UI in your AWS account. When you run it, the AWS ParallelCluster UI accesses an instance of the AWS ParallelCluster API hosted on Amazon API Gateway in your AWS account. - Follow the instructions [here](https://docs.aws.amazon.com/parallelcluster/latest/ug/install-pcui-v3.html), once UI is deployed the link should direct you to the UI and should look like : +The AWS documentation for installing the PCUI can be found [here](https://docs.aws.amazon.com/parallelcluster/latest/ug/install-pcui-v3.html). - ![parallel cluster UI image](/docs/images/ParallelclusterUI.PNG) +To install an instance of the AWS ParallelCluster UI (PCUI), you choose an AWS CloudFormation quick-create link for the AWS Region that you create clusters in. The quick-create URL takes you to a Create Stack Wizard where you provide quick-create stack template inputs and deploy the stack. - ### Create and Configure ParallelCluster using the UI without pre-made templates - - 1. In the AWS ParallelCluster UI Clusters view, choose Create cluster, Step by step. - - 2. In Cluster, Name, enter a name for your cluster. - - 3. Choose a VPC and a subnet for your cluster, and choose Next. - - 4. In Head node, choose Add SSM session, and choose Next. - - 5. In Queues, Compute resources, choose 1 for Static nodes. - - 6. Choose and instance type, and choose Next. - - 7. In Storage, select desired shared storage type or stay with default options and choose Next. - - 8. In Cluster configuration, review the cluster configuration YAML and choose Dry run to validate it. - - 9. Choose Create to create your cluster, based on the validated configuration. - - 10. After a few seconds, the AWS ParallelCluster UI automatically navigates you back to Clusters, where you can monitor the cluster create status and Stack events. - - 11. Choose Details to see cluster details, such as the version and status. +The quick link for creating this stack within the us-east-1 region can be found [here](https://us-east-1.console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/create/review?stackName=parallelcluster-ui&templateURL=https://parallelcluster-ui-release-artifacts-us-east-1.s3.us-east-1.amazonaws.com/parallelcluster-ui.yaml) - 12. Choose Instances to see the list of EC2 instances and status. +**Use an AWS CloudFormation quick-create link to deploy an PCUI stack with nested Amazon Cognito, API Gateway, and Amazon EC2 Systems Manager stacks.** - 13. Choose Stack events to view cluster stack events, and a AWS Management Console link to the CloudFormation stack that creates the cluster. +1. Sign in to the AWS Management Console. +2. Deploy the PCUI by choosing an AWS Region quick-create link from the table at the start of this section. This takes you to the CloudFormation **Create Stack Wizard** in the console. +3. Enter a valid email address for **Admin's Email.** +4. Enter an **ImageBuilder Custom VPC**. This can be found in the VPC Dashboard on AWS Console. +5. After deployment completes successfully, the PCUI sends you a temporary password to this email address. You use the temporary password to access the PCUI. If you delete the email before you save or use the temporary password, you must delete the stack and reinstall the PCUI. +6. Keep the rest of the form blank or enter values for (optional) parameters to customize the PCUI build. +7. Note the stack name for use in later steps. +8. Navigate to **Capabilities.** Agree to the CloudFormation capabilities. +9. Choose **Create.** It takes about 15 minutes to complete the AWS ParallelCluster API and PCUI deployment. +10. View the stack details as the stack is created. +11. After the deployment completes, open the admin email that was sent to the address you entered. It contains a temporary password that you use to access the PCUI. If you permanently delete the email and you haven’t yet logged in to the PCUI, you must delete the PCUI stack you created and reinstall the PCUI. +12. In the AWS CloudFormation console list of stacks, choose the link to the stack name that you noted in a previous step. +13. In **Stack details**, choose **Outputs** and select the link for the key named **StacknameURL** to open the PCUI. Stackname is the name that you noted in a previous step. +14. Enter the temporary password. Follow the steps to create your own password and log in. +15. You are now on the home page of the PCUI in the AWS Region that you selected. - 14. In Details, after cluster creation completes, choose View YAML to view or download the cluster configuration YAML file. + ![parallel cluster UI image](images/ParallelclusterUI.PNG) - 15. After cluster creation completes, choose Shell to access the cluster head node. # Installing the AWS ParallelCluster CLI: @@ -132,7 +123,7 @@ export PATH="$HOME/mambaforge/bin:$PATH" Output should display version number. ## 5. Move on to configuring and connecting to your cluster - Follow the [next instructions](/docs/Configure_AWSParallelCluster.md) on how to configure your cluster. + Follow the [next instructions](/docs/Configure_AWSParallelCluster.md) on how to configure your cluster via the command line. Otherwise checkout our tutorial for using Parallel Cluster via the console to run Snakemake workflows [here](https://github.com/STRIDES/NIHCloudLabAWS/blob/aws-parallel-cluster/notebooks/Snakemake/AWS-ParallelCluster.ipynb). diff --git a/docs/images/create-cluster.png b/docs/images/create-cluster.png new file mode 100644 index 0000000..23efd88 Binary files /dev/null and b/docs/images/create-cluster.png differ diff --git a/docs/images/nav-url-from-output.png b/docs/images/nav-url-from-output.png new file mode 100644 index 0000000..97da228 Binary files /dev/null and b/docs/images/nav-url-from-output.png differ diff --git a/docs/images/pcui.png b/docs/images/pcui.png new file mode 100644 index 0000000..32744b7 Binary files /dev/null and b/docs/images/pcui.png differ diff --git a/notebooks/Snakemake/AWS-ParallelCluster.ipynb b/notebooks/Snakemake/AWS-ParallelCluster.ipynb new file mode 100644 index 0000000..3901403 --- /dev/null +++ b/notebooks/Snakemake/AWS-ParallelCluster.ipynb @@ -0,0 +1,842 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "1b5b1446", + "metadata": {}, + "source": [ + "# Snakemake on AWS ParallelCluster \n", + "\n", + "**Skill Level: Beginner**\n", + "\n", + "## Overview\n", + "\n", + "### AWS ParallelCluster \n", + "AWS ParallelCluster is a fully managed tool that simplifies the creation, management, and deployment of High-Performance Computing (HPC) clusters on the AWS cloud. It automates the setup and running of HPC clusters. Users can define their cluster configurations, including instance types, storage options, install scripts, and advanced networking settings like VPC, subnets, and security groups.\n", + "\n", + "AWS ParallelCluster integrates with HPC schedulers like Slurm, offering robust job scheduling and resource management. \n", + "\n", + "### Snakemake and the `pcluster-Slurm` plugin\n", + "\n", + "Snakemake is a workflow manager that simplifies the process of creating and executing complex data analysis pipelines. It uses a Python-based language to define workflows and automate the execution of tasks. \n", + "\n", + "Snakemake workflows can be deployed seamlessly on AWS ParallelCluster using the `pcluster-Slurm` plugin. This plugin enables the use of Slurm via AWS ParallelCluster as an executor for Snakemake workflows. " + ] + }, + { + "cell_type": "markdown", + "id": "10304f5e", + "metadata": {}, + "source": [ + "## Learning Objectives \n", + "\n", + "We hope that by the end of this tutorial, you will\n", + "1. Understand the concepts of AWS ParallelCluster for High-Performance Computing.\n", + "2. Know how to set up and configure an AWS ParallelCluster environment.\n", + "3. Gain a basic understanding of Snakemake as a workflow manager and learn how to define and execute workflows using Snakemake's syntax.\n", + "4. Discover how to integrate Snakemake workflows with AWS ParallelCluster using the pcluster-Slurm plugin." + ] + }, + { + "cell_type": "markdown", + "id": "0970e82b", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "* Access to a VPC network through your AWS account\n", + "* Access to the AWS ParallelCluster UI \n", + "\n", + "Please follow the installation instructions for the ParallelCluster UI provided here: [here](https://github.com/STRIDES/NIHCloudLabAWS/blob/main/docs/Install_AWSParallelCluster.md). These instructions will guide you through the necessary steps to create a CloudFormation Stack through which you can access the AWS ParallelCluster UI. \n", + "\n", + "Additionally, we urge you to check out the documents within the `docs/` folder of the repository for more bioinformatics and Gen AI tutorials.\n", + "\n", + "Once you have created the Cloud Formation Stack for the PCUI, navigate to the user interface URL. It will look like this:![alt text](../../docs/images/pcui.png)\n" + ] + }, + { + "cell_type": "markdown", + "id": "60153044-61d4-4672-8a98-73f87a3f64cf", + "metadata": {}, + "source": [ + "## Pricing \n", + "\n", + "When using AWS ParallelCluster, you only pay for the services utilized when creating and operating clusters, including computing, storage, and CloudFormation costs. A full list of services that AWS ParallelCluster may use can be found [here](https://docs.aws.amazon.com/parallelcluster/latest/ug/aws-services-v3.html).\n", + "\n", + "The PCUI is built on a serverless architecture and can be used through a free tier for most instances. For more information on the costs associated with the PCUI, please refer to the documentation found [here](https://docs.aws.amazon.com/parallelcluster/latest/ug/install-pcui-costs-v3.html)." + ] + }, + { + "cell_type": "markdown", + "id": "ca5bcc55", + "metadata": {}, + "source": [ + "## Get Started\n", + "\n", + "### Create a Cluster \n", + "Let's create a cluster within the ParallelCluster environment.\n", + "\n", + "![alt text](../../docs/images/create-cluster.png)\n", + "\n", + "1. In the PCUI Clusters view, choose **Create cluster** > **Step by step**.\n", + "2. In Cluster, **Name**, enter a name for your cluster.\n", + "3. Choose a **VPC** from the available options and choose Next. CloudLab users will have access to pre-configured VPC networks.\n", + "4. In **Head node**, choose Add **SSM session**. This will allow you to access the head node through the **`Shell`** button. Change the instance type of your head node to **t2.xlarge**. \n", + "5. In **Queues**, provide a name and subnet for your queue.\n", + "6. In **Compute resources**, choose 1 for **Static nodes** and select **c5n.large** as the instance type for your compute resources. \n", + "7. In Storage, choose Next.\n", + "8. In Cluster configuration, review the cluster configuration YAML and choose **Dry run** to \n", + "validate it.\n", + "1. Choose **Create** to create your cluster, based on the validated configuration.\n", + "2. After a few seconds, the PCUI automatically navigates you back to Clusters, where you can\n", + "monitor the cluster creation status and Stack events.\n", + "1. Choose **Details** to see cluster details, such as the version and status.\n", + "2. Choose **Instances** to see the list of Amazon EC2 instances and status.\n", + "3. Choose **Stack events** to view cluster stack events, and a AWS Management Console link to the\n", + "CloudFormation stack that creates the cluster.\n", + "1. In Details, after cluster creation completes, choose **View YAML** to view or download the cluster configuration YAML file.\n", + "\n", + "This is the YAML file for the cluster described above: " + ] + }, + { + "cell_type": "markdown", + "id": "9f2138e1", + "metadata": {}, + "source": [ + "```yaml\n", + "Imds:\n", + " ImdsSupport: v2.0\n", + "HeadNode:\n", + " InstanceType: t2.xlarge\n", + " Imds:\n", + " Secured: true\n", + " LocalStorage:\n", + " RootVolume:\n", + " VolumeType: gp3\n", + " Size: 50\n", + " Networking:\n", + " SubnetId: subnet-0be4c22d8137b8085\n", + " Iam:\n", + " AdditionalIamPolicies:\n", + " - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore\n", + " Ssh:\n", + " KeyName: Snakemake-cluster-key-pair\n", + "Scheduling:\n", + " Scheduler: Slurm\n", + " SlurmQueues:\n", + " - Name: queue-1\n", + " AllocationStrategy: lowest-price\n", + " ComputeResources:\n", + " - Name: queue-1-cr-1\n", + " Instances:\n", + " - InstanceType: c5n.large\n", + " MaxCount: 1\n", + " MinCount: 1\n", + " ComputeSettings:\n", + " LocalStorage:\n", + " RootVolume:\n", + " VolumeType: gp3\n", + " Networking:\n", + " SubnetIds:\n", + " - subnet-0be4c22d8137b8085\n", + " SlurmSettings: {}\n", + "Region: us-east-1\n", + "Image:\n", + " Os: alinux2\n", + "````" + ] + }, + { + "cell_type": "markdown", + "id": "71b66c2c", + "metadata": {}, + "source": [ + "### Set up your head node\n", + "\n", + "After cluster creation completes, click on the **Shell** button to access the cluster head node. We will be working in the Parallel Cluster shell throughout the rest of the tutorial, all commands will be entered there.\n" + ] + }, + { + "cell_type": "markdown", + "id": "0954a6c4", + "metadata": {}, + "source": [ + "1. Switch to user and create a working directory" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2220e0f7-1b79-41b7-87c5-87f64f51fa51", + "metadata": {}, + "outputs": [], + "source": [ + "#Enter in Parallel Cluster shell \n", + "\n", + "sudo -su ssm-user \n", + "`cd ~` \n", + "`mkdir workdir`" + ] + }, + { + "cell_type": "markdown", + "id": "73692fbe", + "metadata": {}, + "source": [ + "2. Install mamba. This is required as we will be executing Snakemake using mamba. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4649fe73-e777-4ca7-9c40-d97555282da0", + "metadata": {}, + "outputs": [], + "source": [ + "#Enter in Parallel Cluster shell \n", + "\n", + "wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\n", + "bash Miniconda3-latest-Linux-x86_64.sh\n", + "conda install mamba -c conda-forge\n", + "mamba --version" + ] + }, + { + "cell_type": "markdown", + "id": "6c90d562", + "metadata": {}, + "source": [ + "3. Install Snakemake and the Snakemake ParallelCluster plugin. \n", + "\n", + "**Note:** the PCluster plugin requires Snakemake > 8.0.0" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fb52b883-9334-4674-adf6-1c99d237a5cc", + "metadata": {}, + "outputs": [], + "source": [ + "#Enter in Parallel Cluster shell \n", + "pip3 install Snakemake==8.25.5\n", + "pip3 install Snakemake-executor-plugin-pcluster-Slurm" + ] + }, + { + "cell_type": "markdown", + "id": "5302b6ba", + "metadata": {}, + "source": [ + "Alternatively, you may use conda to install Snakemake using the following command: " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "488cba78-cd8a-4eba-9d09-5982b02cd42b", + "metadata": {}, + "outputs": [], + "source": [ + "conda install bioconda::Snakemake==8.25.5" + ] + }, + { + "cell_type": "markdown", + "id": "fc10a99d", + "metadata": {}, + "source": [ + "### Submitting a \"Hello World\" job to the Slurm cluster using `sbatch`\n" + ] + }, + { + "cell_type": "markdown", + "id": "dc8187f0", + "metadata": {}, + "source": [ + "You can submit jobs to the Slurm cluster using a Slurm script and the sbatch command for submission. \n", + "\n", + "1. Create a Slurm script. This example runs prints \"Hello World\" in an output file, then appends the file with the name of the compute node the task ran on. " + ] + }, + { + "cell_type": "markdown", + "id": "d4a39461-f53e-49db-9354-862f7fd8d20f", + "metadata": {}, + "source": [ + "To create files in the Parallel Cluster shell we are using the text editor Vim." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0024d33b-b1de-4eb7-bef2-8d8a117a12b5", + "metadata": {}, + "outputs": [], + "source": [ + "#to create the file use\n", + "vim hello-world.Slurm" + ] + }, + { + "cell_type": "markdown", + "id": "33ad0a63", + "metadata": {}, + "source": [ + "```bash \n", + "#hello-world.Slurm\n", + "\n", + "#!/bin/bash\n", + "#SBATCH --job-name=hello-world\n", + "#SBATCH --output=hello-world.out\n", + "#SBATCH --error=hello-world.err\n", + "#SBATCH --ntasks=1\n", + "#SBATCH --time=00:01:00\n", + "\n", + "echo \"Hello, World!\" > ~/workdir/hello-world.out\n", + "\n", + "## Print the hostname of the node the job ran on\n", + "echo \"This job ran on node: $(hostname)\" >> /home/workdir/scripts/hello-world.out\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "a37ecabc-fbba-4d5d-9d93-57c3e045c4f6", + "metadata": {}, + "source": [ + "Once you are done editing your file enter `:wq` then press 'enter' to save all your changes." + ] + }, + { + "cell_type": "markdown", + "id": "907420c9", + "metadata": {}, + "source": [ + "2. Submit the job using an `sbatch` command " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eaf133a7-deb1-4917-b8c1-9649754074f6", + "metadata": {}, + "outputs": [], + "source": [ + "sbatch hello-world.Slurm" + ] + }, + { + "cell_type": "markdown", + "id": "5e062fca", + "metadata": {}, + "source": [ + "### Snakemake workflow files\n", + "\n", + "When running a Snakemake workflow, it is common to organize the workflow dependencies in the following structure: \n", + "\n", + "```bash\n", + "Project Folder\n", + "│\n", + "├── Snakefile\n", + "│\n", + "├── config.yml\n", + "│\n", + "├── environment.yml\n", + "│\n", + "├── data\n", + "│ ├── file 1\n", + "│ └── file 2...\n", + "```\n", + "**Snakefile:** A Snakefile is the main file used in Snakemake to define a workflow. The commands to be executed, the input and output files and the dependencies of each step are defined as rules in this file. This file must be present in the working directory; if named Snakefile, Snakemake will automatically recognize it as the workflow definition file. If named differently, you must use the -s flag to specify the file. \n", + "\n", + "**config.yaml:** The config.yaml file is used to store configuration parameters that can be easily accessed and utilized throughout the workflow. This file allows you to define various settings, paths, parameters, and other variables that your Snakemake rules might need.\n", + "\n", + "**environment.yml:** The environment.yml file defines the software environment required to run the Snakemake workflow include package names and versions. \n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "a57db5db", + "metadata": {}, + "source": [ + "### Snakefile Structure\n", + "\n", + "The required sections of a Snakefile include the Rule definitions and the rule all. \n", + "\n", + "**Rules:** \n", + "\n", + "* Rules are the building blocks of a Snakemake workflow. Each rule describes how to create one or more output files from input files. \n", + "* The commands that must be run, any scripts, and environment parameters can be defined in the rule definition\n", + "\n", + "**Rule all:**\n", + "\n", + "* This rule tells Snakemake what the end products should be when the workflow is complete. It is usually the first rule that is defined. \n", + "\n", + "**Input and Output:** \n", + "\n", + "* These keywords specify the files that are inputs to and outputs from a rule. \n", + "* The input keyword lists the files needed to run the rule, and the output keyword lists the files that will be produced by the rule. \n", + "* The order of execution of rules are determined through the input and output fields. \n", + "\n", + "Example: \n", + "\n", + "```python\n", + "rule rule_1:\n", + " input: \n", + " \"input_1.txt\n", + " output:\n", + " \"output_1.txt\"\n", + "\n", + "rule rule_2: \n", + " input: \n", + " \"output_1.txt\"\n", + " output: \n", + " \"output_2.txt\" \n", + "```\n", + "\n", + "**Shell Command:** \n", + "\n", + "The shell keyword is used to specify the shell command that will be executed to produce the output files.\n", + "\n", + "**Snakefile Example** \n", + "\n", + "A Snakefile can look like this: \n", + "\n", + "```bash\n", + "# Snakefile\n", + "\n", + "# Rule all: specifies the final target of the workflow\n", + "rule all:\n", + " input:\n", + " \"output_2.txt\"\n", + "\n", + "# Rule 1: processes input_1.txt to produce output_1.txt\n", + "rule rule_1:\n", + " input:\n", + " \"input_1.txt\"\n", + " output:\n", + " \"output_1.txt\"\n", + " shell:\n", + " \"cat {input} > {output}\"\n", + "\n", + "# Rule 2: processes output_1.txt to produce output_2.txt\n", + "rule rule_2:\n", + " input:\n", + " \"output_1.txt\"\n", + " output:\n", + " \"output_2.txt\"\n", + " shell:\n", + " \"wc -l {input} > {output}\"\n", + "\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "9942bfc7", + "metadata": {}, + "source": [ + "### Submitting a \"Hello World\" script using Snakemake, Slurm and the `pcluster-Slurm` plugin\n", + "\n", + "Now that we've covered the basics of writing a script in Snakemake, let's create and run our first workflow! \n", + "\n", + "#### Create and Run the workflow\n", + "1. Create a Snakefile within a project directory. \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "edff7ec2-a39e-4416-86f3-7f143cc969f7", + "metadata": {}, + "outputs": [], + "source": [ + "mkdir hello-world-Snakemake\n", + "vim Snakefile" + ] + }, + { + "cell_type": "markdown", + "id": "fd9086db", + "metadata": {}, + "source": [ + "Add the following content to the file: \n", + "\n", + "```python\n", + "#Snakefile\n", + "rule all:\n", + " input:\n", + " \"output.txt\"\n", + "\n", + "rule example_rule:\n", + " output:\n", + " \"output.txt\"\n", + " shell:\n", + " \"\"\"\n", + " echo 'Hello, World!' > {output}\n", + " \"\"\"\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "42d66604-feea-4940-b610-1a00675976dd", + "metadata": {}, + "source": [ + "Once you are done be sure to type `:wq` then press enter to save your changes" + ] + }, + { + "cell_type": "markdown", + "id": "8eb6c0a5-de84-4fb2-90ed-d8c47408de4f", + "metadata": {}, + "source": [ + "**Snakefile Breakdown** \n", + "* This Snakemake workflow defines a rule all that looks for a file called \"output.txt\" \n", + "* In the rule definition, the shell keyword is used to echo \"Hello, World!\" in the output file. " + ] + }, + { + "cell_type": "markdown", + "id": "4590ade3", + "metadata": {}, + "source": [ + "2. Execute the workflow using the Snakemake command, specifying `pcluster-Slurm` as the executor.\n" + ] + }, + { + "cell_type": "markdown", + "id": "e6b21d7c", + "metadata": {}, + "source": [ + "When submitting a Snakemake workflow to the Slurm scheduler integrated in AWS ParallelCluster, you can submit the workflow by using the `--executor` flag and specifying the `pcluster-Slurm` plugin." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b181935f-e90c-4208-b643-448243da8791", + "metadata": {}, + "outputs": [], + "source": [ + "Snakemake --executor pcluster-Slurm " + ] + }, + { + "cell_type": "markdown", + "id": "fbf47ae7", + "metadata": {}, + "source": [ + "### Submitting a bioinformatics Snakemake workflow to the Slurm cluster\n", + "\n", + "In this example, we will use Snakemake and the pcluster-Slurm plugin to run a Bioinformatics pipeline. \n" + ] + }, + { + "cell_type": "markdown", + "id": "0d0d944f", + "metadata": {}, + "source": [ + "#### Download the input data\n", + "\n", + "The input data consists of raw fastq files. Use the curl command to download the data from a public NIGMS google storage bucket." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f3bc502c", + "metadata": {}, + "outputs": [], + "source": [ + "#run in parallel cluster terminal\n", + "\n", + "#Navigate to working directory \n", + "mkdir ~/workdir/bioinfo-workflow\n", + "cd ~/workdir/bioinfo-workflow\n", + "\n", + "#Make data directory \n", + "mkdir ./data \n", + "cd /data\n", + "\n", + "#Run curl commands \n", + "curl -o SRR13349122_1.fastq https://storage.googleapis.com/nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/raw_fastqSub/SRR13349122_1.fastq\n", + "curl -o SRR13349122_2.fastq https://storage.googleapis.com/nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/raw_fastqSub/SRR13349122_2.fastq\n", + "curl -o SRR13349123_1.fastq https://storage.googleapis.com/nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/raw_fastqSub/SRR13349123_1.fastq\n", + "curl -o SRR13349123_2.fastq https://storage.googleapis.com/nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/raw_fastqSub/SRR13349123_2.fastq\n", + "curl -o SRR13349128_1.fastq https://storage.googleapis.com/nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/raw_fastqSub/SRR13349128_1.fastq\n", + "curl -o SRR13349128_2.fastq https://storage.googleapis.com/nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/raw_fastqSub/SRR13349128_2.fastq" + ] + }, + { + "cell_type": "markdown", + "id": "f431a966", + "metadata": {}, + "source": [ + "#### Create an `environment.yml` file" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "de0c673e-4b9b-4249-add7-0b2c0654ca55", + "metadata": {}, + "outputs": [], + "source": [ + "# create a file\n", + "vi environment.yml" + ] + }, + { + "cell_type": "markdown", + "id": "2d7754cc-74a4-4476-9523-183079aeccd7", + "metadata": {}, + "source": [ + "Add the following contents to the environment file:" + ] + }, + { + "cell_type": "markdown", + "id": "f64db158", + "metadata": {}, + "source": [ + "```yaml\n", + "name: bioinformatics-test\n", + "channels:\n", + " - bioconda\n", + " - conda-forge\n", + " - defaults\n", + "dependencies:\n", + " - bwa\n", + " - samtools\n", + " - bcftools\n", + " - matplotlib\n", + " - pandas\n", + " - pysam\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "648dfc27", + "metadata": {}, + "source": [ + "#### Create a `config.yaml` file" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9b4369c1-a0b1-4bf5-a5e9-d92105c81087", + "metadata": {}, + "outputs": [], + "source": [ + "vi config.yaml" + ] + }, + { + "cell_type": "markdown", + "id": "2025eced-08f6-4da7-89a0-e8917cdeca37", + "metadata": {}, + "source": [ + "Add the following contents to the config file:" + ] + }, + { + "cell_type": "markdown", + "id": "e27bd7f5", + "metadata": {}, + "source": [ + "```yaml\n", + "conda_env: \"envs/environment.yml\"\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "c5ae0487", + "metadata": {}, + "source": [ + "#### Snakefile Breakdown:\n", + "\n", + "In this section, we'll breakdown what the Snakemake workflow will accomplish by looking at teh sections of the Snakefile. \n", + "\n", + "**Overview:** \n", + "The Snakefile maps a set of fastq files to a reference genome, sorts and indexes the mapped reads and finally runs variant calling on the mapped reads.\n", + "\n", + "**Configuration:** \n", + "* `configfile: \"config.yaml\"` specifies the configuration file you have created\n", + "* `SAMPLES = [\"A\", \"B\"]` defines the samples to be processed.\n", + "\n", + "**Workflow:**\n", + "* **all:** Specifies the final output files required to complete the workflow.\n", + "* Bioinformatics rules \n", + " * **bwa_index:** Indexes the reference genome file (data/genome.fa) for alignment.\n", + " * **bwa_map:** Maps the sequencing reads (data/samples/{sample}.fastq) to the indexed genome and converts the output to BAM format.\n", + " * **samtools_sort:** Sorts the BAM files generated from the mapping step.\n", + " * **samtools_index:** Indexes the sorted BAM files for faster access.\n", + " * **bcftools_call:** Calls genetic variants from the sorted and indexed BAM files.\n", + " * **plot_quals:** Generates a plot of the quality of the called variants.\n", + " " + ] + }, + { + "cell_type": "markdown", + "id": "977f99c6", + "metadata": {}, + "source": [ + "#### Make the Snakefile \n", + "\n", + "Create a Snakefile and add the contents described below to it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e6ee453c-435a-4c21-bf7d-051c918f93f8", + "metadata": {}, + "outputs": [], + "source": [ + "#create the Snakemake file\n", + "vi Snakefile" + ] + }, + { + "cell_type": "markdown", + "id": "0c5e54b4", + "metadata": {}, + "source": [ + "```python\n", + "#Snakefile\n", + "configfile: \"config.yaml\"\n", + "\n", + "SAMPLES = [\"A\", \"B\"]\n", + "\n", + "rule all:\n", + " input:\n", + " expand(\"mapped_reads/{sample}.bam\", sample=SAMPLES),\n", + " expand(\"sorted_reads/{sample}.bam.bai\", sample=SAMPLES)\n", + " \"calls/all.vcf\"\n", + "\n", + "rule bwa_index:\n", + " input:\n", + " \"data/genome.fa\"\n", + " output:\n", + " \"data/genome.fa.bwt\"\n", + " conda:\n", + " config[\"conda_env\"]\n", + " shell:\n", + " \"\"\"\n", + " bwa index {input}\n", + " \"\"\"\n", + "\n", + "rule bwa_map:\n", + " input:\n", + " genome=\"data/genome.fa\",\n", + " fastq=\"data/samples/{sample}.fastq\",\n", + " index=\"data/genome.fa.bwt\"\n", + " output:\n", + " \"mapped_reads/{sample}.bam\"\n", + " conda:\n", + " config[\"conda_env\"]\n", + " shell:\n", + " \"\"\"\n", + " bwa mem {input.genome} {input.fastq} > mapped_reads/{wildcards.sample}.sam\n", + " samtools view -Sb mapped_reads/{wildcards.sample}.sam > {output}\n", + " rm mapped_reads/{wildcards.sample}.sam\n", + " \"\"\"\n", + "\n", + "rule samtools_sort:\n", + " input:\n", + " \"mapped_reads/{sample}.bam\"\n", + " output:\n", + " \"sorted_reads/{sample}.bam\"\n", + " conda:\n", + " config[\"conda_env\"]\n", + " shell:\n", + " \"samtools sort -T sorted_reads/{wildcards.sample} -O bam {input} > {output}\"\n", + "\n", + "rule samtools_index:\n", + " input:\n", + " \"sorted_reads/{sample}.bam\"\n", + " output:\n", + " \"sorted_reads/{sample}.bam.bai\"\n", + " conda:\n", + " config[\"conda_env\"]\n", + " shell:\n", + " \"samtools index {input}\"\n", + "\n", + "rule bcftools_call:\n", + " input:\n", + " fa=\"data/genome.fa\",\n", + " bam=expand(\"sorted_reads/{sample}.bam\", sample=SAMPLES),\n", + " bai=expand(\"sorted_reads/{sample}.bam.bai\", sample=SAMPLES)\n", + " output:\n", + " \"calls/all.vcf\"\n", + " conda:\n", + " config[\"conda_env\"]\n", + " shell:\n", + " \"bcftools mpileup -f {input.fa} {input.bam} | bcftools call -mv - > {output}\"\n", + "\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "2f186b6f", + "metadata": {}, + "source": [ + "#### Execute the workflow \n", + "\n", + "Execute the workflow using the **Snakemake** command, specifying **pcluster-Slurm** as the executor and **conda** as the environment management system. Here we are adding two new flags:\n", + "\n", + "- **--use-conda --conda-frontend mamba**: This flag tells Snakemake to use mamba environments for managing dependencies. When this flag is used, Snakemake will look for environment.yaml files specified in the workflow rules and create mamba environments accordingly. \n", + "\n", + "- **-j**: This flag specifies the number of jobs (or threads) to run in parallel." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "936e689c-2c80-4f3a-8e40-32ba6a140ebe", + "metadata": {}, + "outputs": [], + "source": [ + "#enter in the Parallel Cluster shell\n", + "Snakemake --executor pcluster-Slurm --use-conda --conda-frontend mamba -j 5" + ] + }, + { + "cell_type": "markdown", + "id": "9b2213dd", + "metadata": {}, + "source": [ + "## Conclusion \n", + "\n", + "Congratulations! You have successfully learned how to set up and manage High-Performance Computing (HPC) clusters on AWS using AWS ParallelCluster and you've run a Snakemake workflow on this cluster. " + ] + }, + { + "cell_type": "markdown", + "id": "1a5cf6fa", + "metadata": {}, + "source": [ + "## Clean Up: \n", + "- To stop all compute nodes select, Navigate to the PCUI and select Stop fleet.\n", + "- To clean up, in the Clusters view, select the cluster, and choose Actions, Delete cluster.\n", + "\n", + "## References: \n", + "* [AWS ParallelCluster](https://docs.aws.amazon.com/parallelcluster/)\n", + "* [Snakemake Documentation](https://Snakemake.readthedocs.io/en/stable/)\n", + "* [Snakemake `pcluster-Slurm` plugin](https://Snakemake.github.io/Snakemake-plugin-catalog/plugins/executor/pcluster-Slurm.html)\n" + ] + } + ], + "metadata": {}, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/notebooks/Snakemake/aws-parallel-cluster-files/bioinformatics-example/Snakefile b/notebooks/Snakemake/aws-parallel-cluster-files/bioinformatics-example/Snakefile new file mode 100644 index 0000000..b40b642 --- /dev/null +++ b/notebooks/Snakemake/aws-parallel-cluster-files/bioinformatics-example/Snakefile @@ -0,0 +1,70 @@ +#Snakefile +configfile: "config.yaml" + +SAMPLES = ["A", "B"] + +rule all: + input: + expand("mapped_reads/{sample}.bam", sample=SAMPLES), + expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES) + "calls/all.vcf" + +rule bwa_index: + input: + "data/genome.fa" + output: + "data/genome.fa.bwt" + conda: + config["conda_env"] + shell: + """ + bwa index {input} + """ + +rule bwa_map: + input: + genome="data/genome.fa", + fastq="data/samples/{sample}.fastq", + index="data/genome.fa.bwt" + output: + "mapped_reads/{sample}.bam" + conda: + config["conda_env"] + shell: + """ + bwa mem {input.genome} {input.fastq} > mapped_reads/{wildcards.sample}.sam + samtools view -Sb mapped_reads/{wildcards.sample}.sam > {output} + rm mapped_reads/{wildcards.sample}.sam + """ + +rule samtools_sort: + input: + "mapped_reads/{sample}.bam" + output: + "sorted_reads/{sample}.bam" + conda: + config["conda_env"] + shell: + "samtools sort -T sorted_reads/{wildcards.sample} -O bam {input} > {output}" + +rule samtools_index: + input: + "sorted_reads/{sample}.bam" + output: + "sorted_reads/{sample}.bam.bai" + conda: + config["conda_env"] + shell: + "samtools index {input}" + +rule bcftools_call: + input: + fa="data/genome.fa", + bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES), + bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES) + output: + "calls/all.vcf" + conda: + config["conda_env"] + shell: + "bcftools mpileup -f {input.fa} {input.bam} | bcftools call -mv - > {output}" diff --git a/notebooks/Snakemake/aws-parallel-cluster-files/bioinformatics-example/config.yml b/notebooks/Snakemake/aws-parallel-cluster-files/bioinformatics-example/config.yml new file mode 100644 index 0000000..aa7dd98 --- /dev/null +++ b/notebooks/Snakemake/aws-parallel-cluster-files/bioinformatics-example/config.yml @@ -0,0 +1 @@ +conda_env: "envs/environment.yml" diff --git a/notebooks/Snakemake/aws-parallel-cluster-files/bioinformatics-example/environment.yml b/notebooks/Snakemake/aws-parallel-cluster-files/bioinformatics-example/environment.yml new file mode 100644 index 0000000..fd002e8 --- /dev/null +++ b/notebooks/Snakemake/aws-parallel-cluster-files/bioinformatics-example/environment.yml @@ -0,0 +1,12 @@ +name: bioinformatics-test +channels: + - bioconda + - conda-forge + - defaults +dependencies: + - bwa + - samtools + - bcftools + - matplotlib + - pandas + - pysam diff --git a/notebooks/Snakemake/aws-parallel-cluster-files/hello-world-snakemake/Snakefile b/notebooks/Snakemake/aws-parallel-cluster-files/hello-world-snakemake/Snakefile new file mode 100644 index 0000000..8d9a0b4 --- /dev/null +++ b/notebooks/Snakemake/aws-parallel-cluster-files/hello-world-snakemake/Snakefile @@ -0,0 +1,12 @@ +#Snakefile +rule all: + input: + "output.txt" + +rule example_rule: + output: + "output.txt" + shell: + """ + echo 'Hello, World!' > {output} + """ diff --git a/notebooks/Snakemake/aws-parallel-cluster-files/hello-world.slurm b/notebooks/Snakemake/aws-parallel-cluster-files/hello-world.slurm new file mode 100644 index 0000000..84afd98 --- /dev/null +++ b/notebooks/Snakemake/aws-parallel-cluster-files/hello-world.slurm @@ -0,0 +1,9 @@ +#!/bin/bash +#SBATCH --job-name=hello-world +#SBATCH --output=hello-world.out +#SBATCH --error=hello-world.err +#SBATCH --ntasks=1 +#SBATCH --time=00:01:00 + +echo "Hello, World!" > ~/workdir/hello-world.out +echo "This job ran on node: $(hostname)" >> /home/workdir/scripts/hello-world.out