update README

chrisammon3000 · chrisammon3000 · commit a1b23cf185ef · 2021-12-09T18:30:55.000-08:00
diff --git a/README.md b/README.md
@@ -11,21 +11,16 @@ Graph database representing IPD-IMGT/HLA sequence data as GFE.
   - [Table of Contents](#table-of-contents)
   - [Project Structure](#project-structure)
   - [Description](#description)
-    - [Build Service](#build-service)
-    - [Load Service](#load-service)
-    - [Database Service](#database-service)
-    - [CloudFormation Templates](#cloudformation-templates)
+  - [Services](#services)
+    - [Infrastructure](#infrastructure)
+    - [Database](#database)
+    - [Pipeline](#pipeline)
   - [Installation](#installation)
     - [Prerequisites](#prerequisites)
-  - [Usage](#usage)
-    - [Deployment using Makefile](#deployment-using-makefile)
-  - [Local Development](#local-development)
-    - [Creating a Python Virtual Environment](#creating-a-python-virtual-environment)
-    - [Environment Variables](#environment-variables)
-    - [Run Neo4j Docker](#run-neo4j-docker)
-    - [Load the dataset into Neo4j](#load-the-dataset-into-neo4j)
-    - [Memory Management](#memory-management)
-  - [Clean Up](#clean-up)
+    - [AWS Configuration](#aws-configuration)
+    - [Environment Vairables](#environment-vairables)
+  - [Deployment](#deployment)
+  - [Troubleshooting](#troubleshooting)
   - [Authors](#authors)
   - [References & Links](#references--links)
 
@@ -57,91 +52,76 @@ Graph database representing IPD-IMGT/HLA sequence data as GFE.
 ```
 
 ## Description
-The `gfe-db` represents IPD-IMGT/HLA sequence data as GFE nodes and relationships in a Neo4j graph database. The architecture to run and update `gfe-db` contains 3 basic components:
-- Build Service
-- Load Service
-- Database Service
+The `gfe-db` represents IPD-IMGT/HLA sequence data as GFE nodes and relationships in a Neo4j graph database. Running this application will setup the following services in AWS:
+- VPC and subnet
+- Neo4j database server
+- Update pipeline and trigger
 
-This project is meant to be deployed and run on AWS.
+## Services
+The project organizes its resources by service. Deployments are decoupled (using Makefiles). Shared configurations leverage SSM Parameter Store and Secrets Manager.
 
-### Build Service
-The build service is triggered when a new IMGT/HLA version is released. AWS Batch is used to deploy a container to an EC2 instance which will run the build script and generate a dataset of CSVs. These are uploaded to S3 where they can be accessed by the load service. This service is located inside the `build/` directory.
+### Infrastructure
+The infrastructure service deploys a VPC, public subnet, and initial SSM parameters and secrets for the other services to use.
 
-### Load Service
-The load service runs once the build service completes. For each CSV file generated by the build service, a pre-signed URL is created and inserted into the `LOAD CSV FROM ...` statement within the Cypher script. Each statement in the script is sent to the Neo4j server using the HTTP API. The load service runs until Neo4j is done loading. This service is located inside the `load/` directory.
+### Database
+The database service deploys an EC2 instance hosting a Neo4j Docker container into a public subnet so that it can be accessed through a browser.
 
-### Database Service
-Neo4j is deployed within a Docker container to an EC2 instance. Indexes and constraints are set to expedite transactions. The browser can be accessed at port 7474 of the public DNS server found in the EC2 console. This service is located inside the `neo4j/` directory.
-
-### CloudFormation Templates
-CloudFormation templates define the architecture that is deployed to AWS. The basic resources include:
-- VPC with public subnets
-- S3 bucket for templates, data, backups and logs
-- IAM permissions
-- AWS Batch job definitions, queues and compute environments for build and load services
-- StepFunctions state machine to orchestrate the build and load service
-- ECR repositories to host the container images used for the build and load services
-- EC2 Launch Template for deploying Neo4j
-
-```bash
-.
-└── cfn
-    ├── database-stack.yml        # Provisions a VPC and database to EC2
-    ├── master-stack.yml          # Deploys the database and update pipeline stacks
-    ├── setup.yml                 # Provisions the S3 bucket used for template artifacts, secrets, data and logs
-    └── update-pipeline-stack.yml # Provisions the StepFunctions workflow and AWS Batch resources
-```
-
-<!-- ## To Do's
-- [ ] Use Fargate with AWS Batch for the load service instead of EC2 to save cost
-- [x] Create nested cloudformation templates
-- [ ] Add CI/CD for Docker images
-- [ ] Add trigger for when a new IMGT/HLA version is released
-- [x] Load script can be optimized
-  - [x] Better logging
-  - [x] Clean up the Cypher script to avoid Neo4j errors
-- [ ] Add a Makefile
-- [ ] Add SSL policy for Neo4j to use HTTPS
-- [ ] Deploy Neo4j with the APOC and Data Science plugins
-- [ ] Update the userdata in the database.yml template to use the current version
-- [ ] Add structured logs to the scripts
-- [ ] Remove logs, CSVs from S3 prefix before a new build begins
-- [ ] Update the Neo4j configuration to set users and roles for security
-- [ ] Add architecture diagram to documentation
-- [ ] Add constraints to cfn parameters
-- [ ] Add logic to allow building and loading locally -->
+### Pipeline
+The pipeline service automates updates of the database using a scheduled Lambda which will trigger a build and load of new data when it becomes available. The trigger Lambda watches the source data repository and triggers the pipeline when a new IMGT/HLA version is released. The pipeline uses a StepFunctions state machine to orchestrate the build and load steps using AWS Batch.
 
 ## Installation
-Follow the steps to set up a local development environment.
+Follow the steps to set the deployment environment.
 
 ### Prerequisites
 * Python 3.8
 * GNU Make 3.81
-* Docker
 * AWS CLI
+* SAM CLI
+* Docker
 * jq
 
-## Usage
+### AWS Configuration
+Valid AWS credentials must be available to AWS CLI and SAM CLI. The easiest way to do this is running `aws configure`, or by adding them to `~/.aws/credentials` and exporting the `AWS_PROFILE` variable to the environment.
+
+For more information visit the documentation page:
+[Configuration and credential file settings](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html)
+
+### Environment Vairables
+Non-sensistive environment variables are handled by the root Makefile. Sensitive environment variables containing secrets like passwords and API keys must be exported to the environment first.
 
-### Deployment using Makefile
-Make sure to update your AWS credentials in `~/.aws/credentials`. 
+Create a `.env` file in the project root.
+```bash
+NEO4J_USERNAME=<value>
+NEO4J_PASSWORD=<value>
+GITHUB_PERSONAL_ACCESS_TOKEN=<value>
+```
 
+Source the variables to the environment.
 ```bash
-# Get an overview of make arguments and variables
-make
+set -a
+source .env
+set +a
+```
+*Important:* *Always use a `.env` file or AWS SSM Parameter Store or Secrets Manager for sensitive variables like credentials and API keys. Never hard-code them, including when developing. AWS will quarantine an account if any credentials get accidentally exposed and this will cause problems. **MAKE SURE `.env` IS LISTED IN `.gitignore`.**
 
-# Deploy infrastructure to AWS
+## Deployment
+Once an AWS profile is configured and environment variables are exported, the application can be deployed using `make`.
+```bash
 make deploy
+```
+It is also possible to deploy or update individual services.
+```bash
+# Deploy/update only the infrastructure service
+make deploy.infrastructure
 
-# Build and load 1000 alleles from IMGT/HLA release version 3450
-# Leave limit blank to load all alleles
-make run release=3450 limit=1000
+# Deploy/update only the database service
+make deploy.database
 
-# Delete all data and tear down the architecture
-make delete
+# Deploy/update only the pipeline service
+make deploy.pipeline
 ```
 
-## Local Development
+<!-- ## Local Development
 
 ### Creating a Python Virtual Environment
 When developing locally, you will need to create individual virtual environments inside the `build/` and `load/` directories, since they require different dependencies:
@@ -153,31 +133,6 @@ pip install -U pip
 pip install -r requirements.txt
 ```
 
-### Environment Variables
-Add a `.env` file with the following variables.
-```bash
-GFE_BUCKET=<value>
-RELEASES=<value>
-ALIGN=<value>
-KIR=<value>
-MEM_PROFILE=<value>
-LIMIT=<value>
-NEO4J_HOST=<value>
-NEO4J_USERNAME=<value>
-NEO4J_PASSWORD=<value>
-```
-Run the command to export environment variables to the shell.
-```bash
-set -a
-source .env
-set +a
-```
-
-*Important:* *Always use a `.env` file or AWS SSM Parameter Store or Secrets Manager for sensitive variables like credentials and API keys. Never hard-code them, including when developing. AWS will quarantine an account if any credentials get accidentally exposed and this will cause problems. **MAKE SURE `.env` IS LISTED IN `.gitignore`.**
-
-<!-- ## Usage
-Follow these steps in sequence to build and load `gfe-db` locally. Make sure that your environment variables are set correctly before proceeding. -->
-
 ### Run Neo4j Docker
 Build the Docker image as defined in the Dockerfile. See [Configuring Neo4j in Dockerfile](#Configuring-Neo4j-in-Dockerfile) for important configuration settings.
 ```
@@ -205,7 +160,7 @@ docker stop gfe-db
 
 # Start container
 docker start gfe-db
-```
+``` -->
 
 <!-- ### Build GFE dataset
 Run the command to build the container for the build service.
@@ -266,7 +221,7 @@ Development notebook for refactoring `gfe-db` and the `build/src/build_gfedb.py`
 
 <!-- ## Running Tests -->
 
-## Configuring Neo4j in Dockerfile
+<!-- ## Configuring Neo4j in Dockerfile
 Configuration settings for Neo4j are passed through environment variables in the Dockerfile.
 
 ### Username & Password
@@ -282,7 +237,7 @@ Optimal memory for Neo4j depends on available RAM. Loading and querying a larger
 # Dockerfile; Rebuild the image after updating these
 ENV NEO4J_dbms_memory_heap_initial__size=2G
 ENV NEO4J_dbms_memory_heap_max__size=2G
-```
+``` -->
 
 <!-- ## Deployment
 `gfe-db` is deployed using Docker to an EC2 instance. Automated builds and loading of `gfe-db` on AWS is orchestrated using AWS Batch and StepFunctions. The infrastructure is defined using CloudFormation templates.
@@ -346,12 +301,6 @@ jupyter kernelspec uninstall gfe-db
 ## Troubleshooting
 * Check your AWS credentials in `~/.aws/credentials`
 * Check that the environment variables have been exported
-* Sometimes Neo4j sets permissions on mounted volumes. To get around this run this from the project root:
-  ```bash
-  sudo chmod -R 777 .
-  ```
-* Check that the virtual environment is activated: `source .venv/bin/activate`
-* Check that requirements are installed: `pip install -r requirements.txt`
 * Check that Python 3.8 is being used
 
 ## Authors