You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-[Deployment using Makefile](#deployment-using-makefile)
22
-
-[Local Development](#local-development)
23
-
-[Creating a Python Virtual Environment](#creating-a-python-virtual-environment)
24
-
-[Environment Variables](#environment-variables)
25
-
-[Run Neo4j Docker](#run-neo4j-docker)
26
-
-[Load the dataset into Neo4j](#load-the-dataset-into-neo4j)
27
-
-[Memory Management](#memory-management)
28
-
-[Clean Up](#clean-up)
20
+
-[AWS Configuration](#aws-configuration)
21
+
-[Environment Vairables](#environment-vairables)
22
+
-[Deployment](#deployment)
23
+
-[Troubleshooting](#troubleshooting)
29
24
-[Authors](#authors)
30
25
-[References & Links](#references--links)
31
26
@@ -57,91 +52,76 @@ Graph database representing IPD-IMGT/HLA sequence data as GFE.
57
52
```
58
53
59
54
## Description
60
-
The `gfe-db` represents IPD-IMGT/HLA sequence data as GFE nodes and relationships in a Neo4j graph database. The architecture to run and update `gfe-db` contains 3 basic components:
61
-
-Build Service
62
-
-Load Service
63
-
-Database Service
55
+
The `gfe-db` represents IPD-IMGT/HLA sequence data as GFE nodes and relationships in a Neo4j graph database. Running this application will setup the following services in AWS:
56
+
-VPC and subnet
57
+
-Neo4j database server
58
+
-Update pipeline and trigger
64
59
65
-
This project is meant to be deployed and run on AWS.
60
+
## Services
61
+
The project organizes its resources by service. Deployments are decoupled (using Makefiles). Shared configurations leverage SSM Parameter Store and Secrets Manager.
66
62
67
-
### Build Service
68
-
The build service is triggered when a new IMGT/HLA version is released. AWS Batch is used to deploy a container to an EC2 instance which will run the build script and generate a dataset of CSVs. These are uploaded to S3 where they can be accessed by the load service. This service is located inside the `build/` directory.
63
+
### Infrastructure
64
+
The infrastructure service deploys a VPC, public subnet, and initial SSM parameters and secrets for the other services to use.
69
65
70
-
### Load Service
71
-
The load service runs once the build service completes. For each CSV file generated by the build service, a pre-signed URL is created and inserted into the `LOAD CSV FROM ...` statement within the Cypher script. Each statement in the script is sent to the Neo4j server using the HTTP API. The load service runs until Neo4j is done loading. This service is located inside the `load/` directory.
66
+
### Database
67
+
The database service deploys an EC2 instance hosting a Neo4j Docker container into a public subnet so that it can be accessed through a browser.
72
68
73
-
### Database Service
74
-
Neo4j is deployed within a Docker container to an EC2 instance. Indexes and constraints are set to expedite transactions. The browser can be accessed at port 7474 of the public DNS server found in the EC2 console. This service is located inside the `neo4j/` directory.
75
-
76
-
### CloudFormation Templates
77
-
CloudFormation templates define the architecture that is deployed to AWS. The basic resources include:
78
-
- VPC with public subnets
79
-
- S3 bucket for templates, data, backups and logs
80
-
- IAM permissions
81
-
- AWS Batch job definitions, queues and compute environments for build and load services
82
-
- StepFunctions state machine to orchestrate the build and load service
83
-
- ECR repositories to host the container images used for the build and load services
84
-
- EC2 Launch Template for deploying Neo4j
85
-
86
-
```bash
87
-
.
88
-
└── cfn
89
-
├── database-stack.yml # Provisions a VPC and database to EC2
90
-
├── master-stack.yml # Deploys the database and update pipeline stacks
91
-
├── setup.yml # Provisions the S3 bucket used for template artifacts, secrets, data and logs
92
-
└── update-pipeline-stack.yml # Provisions the StepFunctions workflow and AWS Batch resources
93
-
```
94
-
95
-
<!-- ## To Do's
96
-
- [ ] Use Fargate with AWS Batch for the load service instead of EC2 to save cost
97
-
- [x] Create nested cloudformation templates
98
-
- [ ] Add CI/CD for Docker images
99
-
- [ ] Add trigger for when a new IMGT/HLA version is released
100
-
- [x] Load script can be optimized
101
-
- [x] Better logging
102
-
- [x] Clean up the Cypher script to avoid Neo4j errors
103
-
- [ ] Add a Makefile
104
-
- [ ] Add SSL policy for Neo4j to use HTTPS
105
-
- [ ] Deploy Neo4j with the APOC and Data Science plugins
106
-
- [ ] Update the userdata in the database.yml template to use the current version
107
-
- [ ] Add structured logs to the scripts
108
-
- [ ] Remove logs, CSVs from S3 prefix before a new build begins
109
-
- [ ] Update the Neo4j configuration to set users and roles for security
110
-
- [ ] Add architecture diagram to documentation
111
-
- [ ] Add constraints to cfn parameters
112
-
- [ ] Add logic to allow building and loading locally -->
69
+
### Pipeline
70
+
The pipeline service automates updates of the database using a scheduled Lambda which will trigger a build and load of new data when it becomes available. The trigger Lambda watches the source data repository and triggers the pipeline when a new IMGT/HLA version is released. The pipeline uses a StepFunctions state machine to orchestrate the build and load steps using AWS Batch.
113
71
114
72
## Installation
115
-
Follow the steps to set up a local development environment.
73
+
Follow the steps to set the deployment environment.
116
74
117
75
### Prerequisites
118
76
* Python 3.8
119
77
* GNU Make 3.81
120
-
* Docker
121
78
* AWS CLI
79
+
* SAM CLI
80
+
* Docker
122
81
* jq
123
82
124
-
## Usage
83
+
### AWS Configuration
84
+
Valid AWS credentials must be available to AWS CLI and SAM CLI. The easiest way to do this is running `aws configure`, or by adding them to `~/.aws/credentials` and exporting the `AWS_PROFILE` variable to the environment.
85
+
86
+
For more information visit the documentation page:
87
+
[Configuration and credential file settings](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html)
88
+
89
+
### Environment Vairables
90
+
Non-sensistive environment variables are handled by the root Makefile. Sensitive environment variables containing secrets like passwords and API keys must be exported to the environment first.
125
91
126
-
### Deployment using Makefile
127
-
Make sure to update your AWS credentials in `~/.aws/credentials`.
92
+
Create a `.env` file in the project root.
93
+
```bash
94
+
NEO4J_USERNAME=<value>
95
+
NEO4J_PASSWORD=<value>
96
+
GITHUB_PERSONAL_ACCESS_TOKEN=<value>
97
+
```
128
98
99
+
Source the variables to the environment.
129
100
```bash
130
-
# Get an overview of make arguments and variables
131
-
make
101
+
set -a
102
+
source .env
103
+
set +a
104
+
```
105
+
*Important:**Always use a `.env` file or AWS SSM Parameter Store or Secrets Manager for sensitive variables like credentials and API keys. Never hard-code them, including when developing. AWS will quarantine an account if any credentials get accidentally exposed and this will cause problems. **MAKE SURE `.env` IS LISTED IN `.gitignore`.**
132
106
133
-
# Deploy infrastructure to AWS
107
+
## Deployment
108
+
Once an AWS profile is configured and environment variables are exported, the application can be deployed using `make`.
109
+
```bash
134
110
make deploy
111
+
```
112
+
It is also possible to deploy or update individual services.
113
+
```bash
114
+
# Deploy/update only the infrastructure service
115
+
make deploy.infrastructure
135
116
136
-
# Build and load 1000 alleles from IMGT/HLA release version 3450
137
-
# Leave limit blank to load all alleles
138
-
make run release=3450 limit=1000
117
+
# Deploy/update only the database service
118
+
make deploy.database
139
119
140
-
#Delete all data and tear down the architecture
141
-
make delete
120
+
#Deploy/update only the pipeline service
121
+
make deploy.pipeline
142
122
```
143
123
144
-
## Local Development
124
+
<!--## Local Development
145
125
146
126
### Creating a Python Virtual Environment
147
127
When developing locally, you will need to create individual virtual environments inside the `build/` and `load/` directories, since they require different dependencies:
@@ -153,31 +133,6 @@ pip install -U pip
153
133
pip install -r requirements.txt
154
134
```
155
135
156
-
### Environment Variables
157
-
Add a `.env` file with the following variables.
158
-
```bash
159
-
GFE_BUCKET=<value>
160
-
RELEASES=<value>
161
-
ALIGN=<value>
162
-
KIR=<value>
163
-
MEM_PROFILE=<value>
164
-
LIMIT=<value>
165
-
NEO4J_HOST=<value>
166
-
NEO4J_USERNAME=<value>
167
-
NEO4J_PASSWORD=<value>
168
-
```
169
-
Run the command to export environment variables to the shell.
170
-
```bash
171
-
set -a
172
-
source .env
173
-
set +a
174
-
```
175
-
176
-
*Important:**Always use a `.env` file or AWS SSM Parameter Store or Secrets Manager for sensitive variables like credentials and API keys. Never hard-code them, including when developing. AWS will quarantine an account if any credentials get accidentally exposed and this will cause problems. **MAKE SURE `.env` IS LISTED IN `.gitignore`.**
177
-
178
-
<!-- ## Usage
179
-
Follow these steps in sequence to build and load `gfe-db` locally. Make sure that your environment variables are set correctly before proceeding. -->
180
-
181
136
### Run Neo4j Docker
182
137
Build the Docker image as defined in the Dockerfile. See [Configuring Neo4j in Dockerfile](#Configuring-Neo4j-in-Dockerfile) for important configuration settings.
183
138
```
@@ -205,7 +160,7 @@ docker stop gfe-db
205
160
206
161
# Start container
207
162
docker start gfe-db
208
-
```
163
+
```-->
209
164
210
165
<!-- ### Build GFE dataset
211
166
Run the command to build the container for the build service.
@@ -266,7 +221,7 @@ Development notebook for refactoring `gfe-db` and the `build/src/build_gfedb.py`
266
221
267
222
<!-- ## Running Tests -->
268
223
269
-
## Configuring Neo4j in Dockerfile
224
+
<!--## Configuring Neo4j in Dockerfile
270
225
Configuration settings for Neo4j are passed through environment variables in the Dockerfile.
271
226
272
227
### Username & Password
@@ -282,7 +237,7 @@ Optimal memory for Neo4j depends on available RAM. Loading and querying a larger
282
237
# Dockerfile; Rebuild the image after updating these
283
238
ENV NEO4J_dbms_memory_heap_initial__size=2G
284
239
ENV NEO4J_dbms_memory_heap_max__size=2G
285
-
```
240
+
```-->
286
241
287
242
<!-- ## Deployment
288
243
`gfe-db` is deployed using Docker to an EC2 instance. Automated builds and loading of `gfe-db` on AWS is orchestrated using AWS Batch and StepFunctions. The infrastructure is defined using CloudFormation templates.
0 commit comments