Genomic Data Services

Flask based web service providing genomic region search, based on regulomedb.org.

Installation Requirements:

To run this application locally you will need to install Docker. To download the machine learning models you need python3.

Application Installation

Download machine learning models

This is required for running indexing. Tests can be run without.

In python3 virtual env, install boto3:

pip install boto3

Download machine learning models:

python utils/download_files.py

Indexing

Using the compose file suitable for your machine:

docker-compose --file docker-compose-index-m1/intel.yml build
docker-compose --file docker-compose-index-m1/intel.yml up

After indexing has finished (takes about 5 minutes) tear down:

docker-compose --file dockeri-compose-index-m1/intel.yml down --remove-orphans

This command will index ES database, creating a directory esdata where it stores the indexes. This is reusable by the app (see instructions for running below).

Running the app

Using the compose file suitable for your machine:

docker-compose --file docker-compose-m1/intel.yml build
docker-compose --file docker-compose-m1/intel.yml up

The application is available in localhost:80.

Tear down:

docker-compose --file docker-compose-m1/intel.yml down --remove-orphans

Testing

Run tests using compose file suitable for your machine:

docker-compose --file docker-compose-test-m1/intel.yml --env-file ./docker_compose/test.env up --build

Tear down:

docker-compose --file docker-compose-test-m1/intel.yml down -v --remove-orphans

Automatic linting

This repo includes configuration for pre-commit hooks. To use pre-commit, install pre-commit, and activate the hooks:

pip install pre-commit==2.17.0
pre-commit install

Now every time you run git commit the automatic checks are run to check the changes you made.

AWS Deployment

A production grade data services deployment consists of three machines:

Main machine that runs the flask app that sends the requests to the ES machines.
Regulome search ES
ENCODED region-search ES

Connecting to the instances

The instances have EC2 Instance Connect installed. You need to install it to connect to the instances. Assume the instance-id of the instance you want to connect to is i-foobarbaz123. You would connect this instance with command:

mssh ubuntu@i-foobarbaz123 --profile regulome --region us-west-2

Demo deployment

Make sure you have activated the virtual environment created above. if you need demo deployment for Regulome or Encoded region search, set an environment variable DEMO_INDEXER_PASSWORD first, the deploy script will use it as password for indexer. Then run the command below. This command will launch one machine for both GDS flask app and Elasticsearch server.
```
python deploy/deploy.py --demo
```

Start indexing on the machine. For RegulomeDB:

cd /home/ubuntu/genomic-data-service
source genomic-venv/bin/activate
python genomic_data_service/region_indexer.py

Or for Encode region search:

cd /home/ubuntu/genomic-data-service
source genomic-venv/bin/activate
python genomic_data_service/region_indexer_encode.py

You can monitor the indexing progress using the flower dashboard at <public IP of the machine/indexer>. For demo purpose, the username and passowrd for indexer is already set in deploy script.

Production grade deployment

The command below will deploy three machines: GDS main machine, Reglulome ES machine and Encoded ES machine:
```
python deploy/deploy.py
```
On each ES machine create a password for accessing the indexer:
```
sudo mkdir -p /etc/apache2
sudo htpasswd -c /etc/apache2/.htpasswd <your-user-name>
```
You will use this login/password to access the flower dashboard on the machines. The dashboard is accessible at <public IP of the ES machine/indexer>. This is accessible to the internet, so be prudent in choosing the login/password (admin is a bad username, it is quite easy to guess).
On the main machine add the IP addresses of the ES machines into /home/ubuntu/genomic-data-service/config/production.cfg. Set the value of REGULOME_ES_HOSTS to the private IP address of the regulome data service machine, and the value of REGION_SEARCH_ES_HOSTS to the private IP address of the region search data service machine (note that in the normal case these values are lists with one item).

Start each service on the main machine:

sudo systemctl daemon-reload
sudo systemctl enable --now genomic.socket
sudo systemctl enable genomic.service
sudo systemctl enable nginx.service
sudo systemctl start genomic
sudo systemctl start nginx

Start regulome region indexer on the regulome ES machine:

cd /home/ubuntu/genomic-data-service
source genomic-venv/bin/activate
python genomic_data_service/region_indexer.py

Start encoded region indexer on the encoded ES machine:

cd /home/ubuntu/genomic-data-service
source genomic-venv/bin/activate
python genomic_data_service/region_indexer_encode.py

You can monitor the indexing progress using the flower dashboard. After indexing has finished (region-search machine indexes in few hours, regulome machine will take couple of days) the machines can be downsized. Good size for the regulome machine is t3a.2xlarge and for the region-search machine t2.xlarge is sufficient. Do not forget to restart the services after resize.
To deploy a regulome demo that uses your new deployment as backend, you need to edit https://github.com/ENCODE-DCC/regulome-encoded/blob/dev/ini-templates/production-template.ini and change the genomic_data_service_url to point to the instance running the flask app.
To deploy an encoded demo that uses your new deployment as the region-search backend, you need to edit https://github.com/ENCODE-DCC/encoded/blob/dev/conf/pyramid/demo.ini and change the genomic_data_service to point to the instance running the flask app.

ElasticSearch server only deployment

if you just want to deploy an ElasticSearch server only, for RegulomeDB:
```
python deploy/deploy.py --es regulome
```
For Encode:
```
python deploy/deploy.py --es encode
```
Follow the instruction in Production grade deployment above to create a password for accessing the indexer and Start indexer on the this ES machine.

Name		Name	Last commit message	Last commit date
Latest commit History 277 Commits
.circleci		.circleci
config		config
deploy		deploy
docker_compose		docker_compose
genomic_data_service		genomic_data_service
migrations		migrations
ml_models		ml_models
tests		tests
utils		utils
.coveragerc		.coveragerc
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README-local-install.md		README-local-install.md
README.md		README.md
docker-compose-index-intel.yml		docker-compose-index-intel.yml
docker-compose-index-m1.yml		docker-compose-index-m1.yml
docker-compose-intel.yml		docker-compose-intel.yml
docker-compose-m1.yml		docker-compose-m1.yml
docker-compose-test-intel.yml		docker-compose-test-intel.yml
docker-compose-test-m1.yml		docker-compose-test-m1.yml
encode_accessions_for_local_install.pickle		encode_accessions_for_local_install.pickle
file_accessions_RegulomeDB_2_1.pickle		file_accessions_RegulomeDB_2_1.pickle
file_accessions_regulomedb_2_0_hg19.pickle		file_accessions_regulomedb_2_0_hg19.pickle
file_accessions_regulomedb_2_2.pickle		file_accessions_regulomedb_2_2.pickle
gene_lookup.pickle		gene_lookup.pickle
migrate.py		migrate.py
pytest.ini		pytest.ini
regulome_accessions.pickle		regulome_accessions.pickle
regulome_accessions_subsampled100.pickle		regulome_accessions_subsampled100.pickle
regulome_encode_accessions_mapping.pickle		regulome_encode_accessions_mapping.pickle
setup.py		setup.py
snp_for_local_install.bed.gz		snp_for_local_install.bed.gz
wsgi.py		wsgi.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Genomic Data Services

Application Installation

Download machine learning models

Indexing

Running the app

Testing

Automatic linting

AWS Deployment

Connecting to the instances

Demo deployment

Production grade deployment

ElasticSearch server only deployment

About

Releases 1

Contributors 6

Languages

License

ENCODE-DCC/genomic-data-service

Folders and files

Latest commit

History

Repository files navigation

Genomic Data Services

Application Installation

Download machine learning models

Indexing

Running the app

Testing

Automatic linting

AWS Deployment

Connecting to the instances

Demo deployment

Production grade deployment

ElasticSearch server only deployment

About

Resources

License

Stars

Watchers

Forks

Releases 1

Contributors 6

Languages