Skip to content

Commit a1b23cf

Browse files
update README
1 parent d1242ee commit a1b23cf

File tree

1 file changed

+59
-110
lines changed

1 file changed

+59
-110
lines changed

README.md

Lines changed: 59 additions & 110 deletions
Original file line numberDiff line numberDiff line change
@@ -11,21 +11,16 @@ Graph database representing IPD-IMGT/HLA sequence data as GFE.
1111
- [Table of Contents](#table-of-contents)
1212
- [Project Structure](#project-structure)
1313
- [Description](#description)
14-
- [Build Service](#build-service)
15-
- [Load Service](#load-service)
16-
- [Database Service](#database-service)
17-
- [CloudFormation Templates](#cloudformation-templates)
14+
- [Services](#services)
15+
- [Infrastructure](#infrastructure)
16+
- [Database](#database)
17+
- [Pipeline](#pipeline)
1818
- [Installation](#installation)
1919
- [Prerequisites](#prerequisites)
20-
- [Usage](#usage)
21-
- [Deployment using Makefile](#deployment-using-makefile)
22-
- [Local Development](#local-development)
23-
- [Creating a Python Virtual Environment](#creating-a-python-virtual-environment)
24-
- [Environment Variables](#environment-variables)
25-
- [Run Neo4j Docker](#run-neo4j-docker)
26-
- [Load the dataset into Neo4j](#load-the-dataset-into-neo4j)
27-
- [Memory Management](#memory-management)
28-
- [Clean Up](#clean-up)
20+
- [AWS Configuration](#aws-configuration)
21+
- [Environment Vairables](#environment-vairables)
22+
- [Deployment](#deployment)
23+
- [Troubleshooting](#troubleshooting)
2924
- [Authors](#authors)
3025
- [References & Links](#references--links)
3126

@@ -57,91 +52,76 @@ Graph database representing IPD-IMGT/HLA sequence data as GFE.
5752
```
5853

5954
## Description
60-
The `gfe-db` represents IPD-IMGT/HLA sequence data as GFE nodes and relationships in a Neo4j graph database. The architecture to run and update `gfe-db` contains 3 basic components:
61-
- Build Service
62-
- Load Service
63-
- Database Service
55+
The `gfe-db` represents IPD-IMGT/HLA sequence data as GFE nodes and relationships in a Neo4j graph database. Running this application will setup the following services in AWS:
56+
- VPC and subnet
57+
- Neo4j database server
58+
- Update pipeline and trigger
6459

65-
This project is meant to be deployed and run on AWS.
60+
## Services
61+
The project organizes its resources by service. Deployments are decoupled (using Makefiles). Shared configurations leverage SSM Parameter Store and Secrets Manager.
6662

67-
### Build Service
68-
The build service is triggered when a new IMGT/HLA version is released. AWS Batch is used to deploy a container to an EC2 instance which will run the build script and generate a dataset of CSVs. These are uploaded to S3 where they can be accessed by the load service. This service is located inside the `build/` directory.
63+
### Infrastructure
64+
The infrastructure service deploys a VPC, public subnet, and initial SSM parameters and secrets for the other services to use.
6965

70-
### Load Service
71-
The load service runs once the build service completes. For each CSV file generated by the build service, a pre-signed URL is created and inserted into the `LOAD CSV FROM ...` statement within the Cypher script. Each statement in the script is sent to the Neo4j server using the HTTP API. The load service runs until Neo4j is done loading. This service is located inside the `load/` directory.
66+
### Database
67+
The database service deploys an EC2 instance hosting a Neo4j Docker container into a public subnet so that it can be accessed through a browser.
7268

73-
### Database Service
74-
Neo4j is deployed within a Docker container to an EC2 instance. Indexes and constraints are set to expedite transactions. The browser can be accessed at port 7474 of the public DNS server found in the EC2 console. This service is located inside the `neo4j/` directory.
75-
76-
### CloudFormation Templates
77-
CloudFormation templates define the architecture that is deployed to AWS. The basic resources include:
78-
- VPC with public subnets
79-
- S3 bucket for templates, data, backups and logs
80-
- IAM permissions
81-
- AWS Batch job definitions, queues and compute environments for build and load services
82-
- StepFunctions state machine to orchestrate the build and load service
83-
- ECR repositories to host the container images used for the build and load services
84-
- EC2 Launch Template for deploying Neo4j
85-
86-
```bash
87-
.
88-
└── cfn
89-
├── database-stack.yml # Provisions a VPC and database to EC2
90-
├── master-stack.yml # Deploys the database and update pipeline stacks
91-
├── setup.yml # Provisions the S3 bucket used for template artifacts, secrets, data and logs
92-
└── update-pipeline-stack.yml # Provisions the StepFunctions workflow and AWS Batch resources
93-
```
94-
95-
<!-- ## To Do's
96-
- [ ] Use Fargate with AWS Batch for the load service instead of EC2 to save cost
97-
- [x] Create nested cloudformation templates
98-
- [ ] Add CI/CD for Docker images
99-
- [ ] Add trigger for when a new IMGT/HLA version is released
100-
- [x] Load script can be optimized
101-
- [x] Better logging
102-
- [x] Clean up the Cypher script to avoid Neo4j errors
103-
- [ ] Add a Makefile
104-
- [ ] Add SSL policy for Neo4j to use HTTPS
105-
- [ ] Deploy Neo4j with the APOC and Data Science plugins
106-
- [ ] Update the userdata in the database.yml template to use the current version
107-
- [ ] Add structured logs to the scripts
108-
- [ ] Remove logs, CSVs from S3 prefix before a new build begins
109-
- [ ] Update the Neo4j configuration to set users and roles for security
110-
- [ ] Add architecture diagram to documentation
111-
- [ ] Add constraints to cfn parameters
112-
- [ ] Add logic to allow building and loading locally -->
69+
### Pipeline
70+
The pipeline service automates updates of the database using a scheduled Lambda which will trigger a build and load of new data when it becomes available. The trigger Lambda watches the source data repository and triggers the pipeline when a new IMGT/HLA version is released. The pipeline uses a StepFunctions state machine to orchestrate the build and load steps using AWS Batch.
11371

11472
## Installation
115-
Follow the steps to set up a local development environment.
73+
Follow the steps to set the deployment environment.
11674

11775
### Prerequisites
11876
* Python 3.8
11977
* GNU Make 3.81
120-
* Docker
12178
* AWS CLI
79+
* SAM CLI
80+
* Docker
12281
* jq
12382

124-
## Usage
83+
### AWS Configuration
84+
Valid AWS credentials must be available to AWS CLI and SAM CLI. The easiest way to do this is running `aws configure`, or by adding them to `~/.aws/credentials` and exporting the `AWS_PROFILE` variable to the environment.
85+
86+
For more information visit the documentation page:
87+
[Configuration and credential file settings](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html)
88+
89+
### Environment Vairables
90+
Non-sensistive environment variables are handled by the root Makefile. Sensitive environment variables containing secrets like passwords and API keys must be exported to the environment first.
12591

126-
### Deployment using Makefile
127-
Make sure to update your AWS credentials in `~/.aws/credentials`.
92+
Create a `.env` file in the project root.
93+
```bash
94+
NEO4J_USERNAME=<value>
95+
NEO4J_PASSWORD=<value>
96+
GITHUB_PERSONAL_ACCESS_TOKEN=<value>
97+
```
12898

99+
Source the variables to the environment.
129100
```bash
130-
# Get an overview of make arguments and variables
131-
make
101+
set -a
102+
source .env
103+
set +a
104+
```
105+
*Important:* *Always use a `.env` file or AWS SSM Parameter Store or Secrets Manager for sensitive variables like credentials and API keys. Never hard-code them, including when developing. AWS will quarantine an account if any credentials get accidentally exposed and this will cause problems. **MAKE SURE `.env` IS LISTED IN `.gitignore`.**
132106

133-
# Deploy infrastructure to AWS
107+
## Deployment
108+
Once an AWS profile is configured and environment variables are exported, the application can be deployed using `make`.
109+
```bash
134110
make deploy
111+
```
112+
It is also possible to deploy or update individual services.
113+
```bash
114+
# Deploy/update only the infrastructure service
115+
make deploy.infrastructure
135116

136-
# Build and load 1000 alleles from IMGT/HLA release version 3450
137-
# Leave limit blank to load all alleles
138-
make run release=3450 limit=1000
117+
# Deploy/update only the database service
118+
make deploy.database
139119

140-
# Delete all data and tear down the architecture
141-
make delete
120+
# Deploy/update only the pipeline service
121+
make deploy.pipeline
142122
```
143123

144-
## Local Development
124+
<!-- ## Local Development
145125
146126
### Creating a Python Virtual Environment
147127
When developing locally, you will need to create individual virtual environments inside the `build/` and `load/` directories, since they require different dependencies:
@@ -153,31 +133,6 @@ pip install -U pip
153133
pip install -r requirements.txt
154134
```
155135
156-
### Environment Variables
157-
Add a `.env` file with the following variables.
158-
```bash
159-
GFE_BUCKET=<value>
160-
RELEASES=<value>
161-
ALIGN=<value>
162-
KIR=<value>
163-
MEM_PROFILE=<value>
164-
LIMIT=<value>
165-
NEO4J_HOST=<value>
166-
NEO4J_USERNAME=<value>
167-
NEO4J_PASSWORD=<value>
168-
```
169-
Run the command to export environment variables to the shell.
170-
```bash
171-
set -a
172-
source .env
173-
set +a
174-
```
175-
176-
*Important:* *Always use a `.env` file or AWS SSM Parameter Store or Secrets Manager for sensitive variables like credentials and API keys. Never hard-code them, including when developing. AWS will quarantine an account if any credentials get accidentally exposed and this will cause problems. **MAKE SURE `.env` IS LISTED IN `.gitignore`.**
177-
178-
<!-- ## Usage
179-
Follow these steps in sequence to build and load `gfe-db` locally. Make sure that your environment variables are set correctly before proceeding. -->
180-
181136
### Run Neo4j Docker
182137
Build the Docker image as defined in the Dockerfile. See [Configuring Neo4j in Dockerfile](#Configuring-Neo4j-in-Dockerfile) for important configuration settings.
183138
```
@@ -205,7 +160,7 @@ docker stop gfe-db
205160
206161
# Start container
207162
docker start gfe-db
208-
```
163+
``` -->
209164

210165
<!-- ### Build GFE dataset
211166
Run the command to build the container for the build service.
@@ -266,7 +221,7 @@ Development notebook for refactoring `gfe-db` and the `build/src/build_gfedb.py`
266221

267222
<!-- ## Running Tests -->
268223

269-
## Configuring Neo4j in Dockerfile
224+
<!-- ## Configuring Neo4j in Dockerfile
270225
Configuration settings for Neo4j are passed through environment variables in the Dockerfile.
271226
272227
### Username & Password
@@ -282,7 +237,7 @@ Optimal memory for Neo4j depends on available RAM. Loading and querying a larger
282237
# Dockerfile; Rebuild the image after updating these
283238
ENV NEO4J_dbms_memory_heap_initial__size=2G
284239
ENV NEO4J_dbms_memory_heap_max__size=2G
285-
```
240+
``` -->
286241

287242
<!-- ## Deployment
288243
`gfe-db` is deployed using Docker to an EC2 instance. Automated builds and loading of `gfe-db` on AWS is orchestrated using AWS Batch and StepFunctions. The infrastructure is defined using CloudFormation templates.
@@ -346,12 +301,6 @@ jupyter kernelspec uninstall gfe-db
346301
## Troubleshooting
347302
* Check your AWS credentials in `~/.aws/credentials`
348303
* Check that the environment variables have been exported
349-
* Sometimes Neo4j sets permissions on mounted volumes. To get around this run this from the project root:
350-
```bash
351-
sudo chmod -R 777 .
352-
```
353-
* Check that the virtual environment is activated: `source .venv/bin/activate`
354-
* Check that requirements are installed: `pip install -r requirements.txt`
355304
* Check that Python 3.8 is being used
356305

357306
## Authors

0 commit comments

Comments
 (0)