Flask based web service providing genomic region search, based on regulomedb.org.
Installation Requirements:
To run this application locally you will need to install Docker. To download the machine learning models you need python3.
This is required for running indexing. Tests can be run without.
In python3 virtual env, install boto3:
pip install boto3
Download machine learning models:
python utils/download_files.py
Using the compose file suitable for your machine:
docker-compose --file docker-compose-index-m1/intel.yml build
docker-compose --file docker-compose-index-m1/intel.yml up
After indexing has finished (takes about 5 minutes) tear down:
docker-compose --file dockeri-compose-index-m1/intel.yml down --remove-orphans
This command will index ES database, creating a directory esdata
where it stores the indexes. This is reusable by the app (see instructions for running below).
Using the compose file suitable for your machine:
docker-compose --file docker-compose-m1/intel.yml build
docker-compose --file docker-compose-m1/intel.yml up
The application is available in localhost:80
.
Tear down:
docker-compose --file docker-compose-m1/intel.yml down --remove-orphans
Run tests using compose file suitable for your machine:
docker-compose --file docker-compose-test-m1/intel.yml --env-file ./docker_compose/test.env up --build
Tear down:
docker-compose --file docker-compose-test-m1/intel.yml down -v --remove-orphans
This repo includes configuration for pre-commit hooks. To use pre-commit, install pre-commit, and activate the hooks:
pip install pre-commit==2.17.0
pre-commit install
Now every time you run git commit
the automatic checks are run to check the changes you made.
A production grade data services deployment consists of three machines:
- Main machine that runs the flask app that sends the requests to the ES machines.
- Regulome search ES
- ENCODED region-search ES
The instances have EC2 Instance Connect installed. You need to install it to connect to the instances. Assume the instance-id of the instance you want to connect to is i-foobarbaz123
. You would connect this instance with command:
mssh ubuntu@i-foobarbaz123 --profile regulome --region us-west-2
-
Make sure you have activated the virtual environment created above. if you need demo deployment for Regulome or Encoded region search, set an environment variable DEMO_INDEXER_PASSWORD first, the deploy script will use it as password for indexer. Then run the command below. This command will launch one machine for both GDS flask app and Elasticsearch server.
python deploy/deploy.py --demo
-
Start indexing on the machine. For RegulomeDB:
cd /home/ubuntu/genomic-data-service source genomic-venv/bin/activate python genomic_data_service/region_indexer.py
-
Or for Encode region search:
cd /home/ubuntu/genomic-data-service source genomic-venv/bin/activate python genomic_data_service/region_indexer_encode.py
-
You can monitor the indexing progress using the flower dashboard at <public IP of the machine/indexer>. For demo purpose, the username and passowrd for indexer is already set in deploy script.
-
The command below will deploy three machines: GDS main machine, Reglulome ES machine and Encoded ES machine:
python deploy/deploy.py
-
On each ES machine create a password for accessing the indexer:
sudo mkdir -p /etc/apache2 sudo htpasswd -c /etc/apache2/.htpasswd <your-user-name>
You will use this login/password to access the flower dashboard on the machines. The dashboard is accessible at <public IP of the ES machine/indexer>. This is accessible to the internet, so be prudent in choosing the login/password (admin is a bad username, it is quite easy to guess).
-
On the main machine add the IP addresses of the ES machines into
/home/ubuntu/genomic-data-service/config/production.cfg
. Set the value ofREGULOME_ES_HOSTS
to the private IP address of the regulome data service machine, and the value ofREGION_SEARCH_ES_HOSTS
to the private IP address of the region search data service machine (note that in the normal case these values are lists with one item). -
Start each service on the main machine:
sudo systemctl daemon-reload sudo systemctl enable --now genomic.socket sudo systemctl enable genomic.service sudo systemctl enable nginx.service sudo systemctl start genomic sudo systemctl start nginx
-
Start regulome region indexer on the regulome ES machine:
cd /home/ubuntu/genomic-data-service source genomic-venv/bin/activate python genomic_data_service/region_indexer.py
-
Start encoded region indexer on the encoded ES machine:
cd /home/ubuntu/genomic-data-service source genomic-venv/bin/activate python genomic_data_service/region_indexer_encode.py
-
You can monitor the indexing progress using the flower dashboard. After indexing has finished (region-search machine indexes in few hours, regulome machine will take couple of days) the machines can be downsized. Good size for the regulome machine is
t3a.2xlarge
and for the region-search machinet2.xlarge
is sufficient. Do not forget to restart the services after resize. -
To deploy a regulome demo that uses your new deployment as backend, you need to edit https://github.com/ENCODE-DCC/regulome-encoded/blob/dev/ini-templates/production-template.ini and change the
genomic_data_service_url
to point to the instance running the flask app. -
To deploy an encoded demo that uses your new deployment as the region-search backend, you need to edit https://github.com/ENCODE-DCC/encoded/blob/dev/conf/pyramid/demo.ini and change the
genomic_data_service
to point to the instance running the flask app.
-
if you just want to deploy an ElasticSearch server only, for RegulomeDB:
python deploy/deploy.py --es regulome
-
For Encode:
python deploy/deploy.py --es encode
-
Follow the instruction in Production grade deployment above to create a password for accessing the indexer and Start indexer on the this ES machine.