Skip to content

Latest commit

 

History

History
279 lines (230 loc) · 17.2 KB

ot_genetics_deployment.md

File metadata and controls

279 lines (230 loc) · 17.2 KB

Open targets internal deployment guide

Overview

Using project: open-targets-genetics-dev.

Currently the genetics team provides input files in a GCP bucket gs://genetics-portal-dev-staging (staging). Some of these files are static, others are annotated with a date (variously YYMMDD and DDMMYY).

A subset of these files are then manually copied by the BE team to gs://genetics-portal-dev-data (dev) in a bucket corresponding to the release.

The files in dev are used to run the pipeline, typically using Dataproc.

Configuration field Likely staging location Standard dev location
variant-index.raw provided by data team /variant-annotation//variant-annotation.parquet
ensembl.lut generated by BE /lut/homo_sapiens_core_105_38_genes.json.gz
vep.homo-sapiens-cons-scores should be in staging bucket /lut/vep_consequences.tsv
interval.path v2g/interval/* /v2g/interval/*/*/<date>/data.parquet
qtl.path v2g/qtl/<date>/ v2g/qlt/<date>
variant-disease.studies v2d/<date>/studies.parquet v2d/studies.parquet
variant-disease.toploci v2d/<date>/toploci.parquet v2d/toploci.parquet
variant-disease.finemapping v2d/<date>/finemapping.parquet/ v2d/finemapping.parquet
variant-disease.ld v2d/<date>/ld.parquet/ v2d/ld.parquet
variant-disease.overlapping v2d/<date>/locus_overlap.parquet v2d/locus_overlap.parquet
variant-disease.coloc coloc/<date>/coloc_processed_w_betas.parquet/ v2d/coloc_processed_w_betas.parquet
variant-disease.trait_efo v2d/<date>/trait_efo-2021-11-16.parquet v2d/trait_efo.parquet

Variant index section

The variant index comes in parquet from the data team after filtering the latest Gnomad release.

If there is no new update keep using the last one used. Currently, the variant annotation is version 190129.

Ensembl

Genetics team do not provide the Ensembl file: we have to download it ourselves and generate the input.

It is a configuration place to bring the latest reference gene table from Ensembl. To generate this file to need to follow the instructions from this script. And the command I use is this as an example

python create_genes_dictionary.py -o "./" -z -n homo_sapiens_core_104_38

The above example uses Ensembl '104'. The most recent version is '105'. If the versions have not changed from the previous release feel free to copy the input file from the previous releases' input directory.

VEP consequences

The TSV file is provided by the genetics team. If the file is not present in the staging bucket ask the Genetics team for the most recent version.

Interval

Provided by the genetics team: these are mainly static and haven't been updated for years. They are in a nested file structure which must be preserved because the ETL uses the file path as an input.

QTL

Provided by the genetics team: these are updated on a regular basis.

Recipe: set up deployment machine

We need a VM to run deployments from. Typically this only needs to be done once and then we can use the machine for future releases.

# install dependencies
sudo apt-get install -y apt-transport-https ca-certificates dirmngr
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 8919F6BD2B48D754

echo "deb https://packages.clickhouse.com/deb stable main" | sudo tee \
    /etc/apt/sources.list.d/clickhouse.list
sudo apt-get update

sudo apt-get install -y clickhouse-client

sudo apt install -y git \
tmux tree wget htop \
libgl1-mesa-glx libegl1-mesa libxrandr2 libxrandr2 libxss1 libxcursor1 libxcomposite1 libasound2 libxi6 libxtst6 \
apt-transport-https ca-certificates dirmngr

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
bash ~/miniconda.sh -p $HOME/miniconda
source ~/.bashrc

# get repositories
git clone https://github.com/opentargets/genetics-backend.git
git clone https://github.com/opentargets/genetics-pipe.git

# set up conda environments
cd genetics-backend && conda env create -f environment.yaml
conda activate backend-genetics
# add elastic-search loader
# https://github.com/moshe/elasticsearch_loader
pip install elasticsearch-loader

cd loaders/clickhouse

Recipe: get all inputs and run the ot-geckopipe

Use the VM in the open-target-genetics-dev machine called gp-deploy. The VM is preconfigured with the necessary utilities to run a release.

  • start deployment machine: gcloud compute instances start "jb-release" --project "open-targets-genetics-dev" --zone "europe-west4-a"
  • SSH into deployment machine: gcloud compute ssh --zone "europe-west4-a" "jb-release" --tunnel-through-iap --project "open-targets-genetics-dev"
  • If not already done, clone required repository: git clone git@github.com:opentargets/genetics-backend.git
  • set up environment: conda activate backend-genetics
  • update Ensembl version (latest 106 Apr 22) and run script from genetics-backend/makeLUTs:
    • python create_genes_dictionary.py -o "./" -z -n homo_sapiens_core_106_38
  • add ensembl file to bucket gsutil cp -n homo_sapiens* gs://genetics-portal-dev-data/22.03/inputs/lut/
  • update variables in bash script in /scripts/prepare_inputs.sh (input script)
  • run input script in VM to move files from staging to dev buckets
    • Most of the inputs are used for the pipeline, but there are two static datasets which are copied, sumstats (sa) and v2d_credset.
    • It's to pipe the STDOUT of the script to a file which can be provided to the genetics/data team for confirmation the correct files were used. ./scripts/prepare_inputs.sh >> genetics_input_log.txt
  • create a configuration file for release in config:
    • cp src/main/resources/application.conf config/<release>.conf and update as necessary.
  • Run genetics-pipe. There are two options here, you can use either a Dataproc workflow (requires Scala) or using bash scripts. The former is easier.
    • Workflow option: Open the worksheet scripts/dataproc-workflow.sc, update top level variables (should only be the input and output directories) and run. You can terminate the worksheet on your local machine once it has started since Dataproc will run in the background. The advantage of using the workflow is that Dataproc will create the specified cluster, run the steps in the right order, then destroy the cluster without the need for any manual intervention.
    • Script options::
      • update top level variables in scripts/run_cluster.sh: release and config should be the only changes necessary.
      • run script scripts/run_cluster.sh from root directory. This script builds a jar file, pushes it to GS storage, starts a cluster and runs all steps. Some of the jobs will fail because of missing dependencies. Consult documentation/step_dependencies for the correct order.
        • In general run in the following phases (some steps can be run concurrently):
          • variant-index (30m), variant-gene (180min)
          • dictionaries, variant-disease (2min), variant-disease-coloc (2min)
          • disease-variant-gene (25min)
          • scored datasets (130min)
          • manhattan (25min) (Run this after the following steps)
  • inform genetics team that the outputs are ready, and they will run the ML pipeline to generate the l2g outputs. The file we need for the final step (manhattan) is typically found under genetics-portal-dev-staging/l2g/<date>/predictions/l2g.full.220128.parquet in the staging area.
  • Copy L2G file from the staging area to the development area (updating dates as necessary): gsutil -m cp -r gs://genetics-portal-dev-staging/l2g/220212/predictions/l2g.full.220212.parquet/part-* gs://genetics-portal-dev-data/22.03/outputs/l2g/
  • Run the manhattan step using either scripts for the workflow scripts/dataproc-workflow-manhattan.sc. Note that the workflow assumes all prior steps have been completed and the inputs are available.
  • Check all the expected output directories are present using the ammonite script amm scripts/check_outputs.sc.

Recipe to create infrastructure

  • Using the genetics backend project start two VMs: one each for ES and Clickhouse using the helper scripts: infrastructure/gcp/genetics/create-clickhouse-node.sh and infrastructure/gcp/genetics/create-elasticsearch-node.sh
  • export variables for the two created VMs:(bind the internal GCP IP address, this assumes you're in a GCP VM yourself.)
    • export ES_HOST=$(gcloud compute instances list | grep -i run | grep elasticsearch | awk '{ print $4 }' | tail -1)
    • export CLICKHOUSE_HOST=$(gcloud compute instances list | grep -i run | grep clickhouse | awk '{ print $4 }' | tail -1)
  • activate the correct python environment: conda activate backend-genetics
  • run the script loaders/clickhouse/create_and_load_everything_from_scratch.sh in the genetics-backend repository, providing a link to the input files.
    • There can be a short delay while the instances start up and complete their installations of ES and CH. You can test if they are ready by running curl $ES_HOST:9200 and curl $CLICKHOUSE_HOST:8123 which should both return a non-error response.
    • Note this process is slow: ~17 hours!
    • ./create_and_load_everything_from_scratch.sh gs://genetics-portal-dev-data/22.01.2/outputs
  • Once loading is complete, 'bake' the instances so that we can deploy the images using Terraform.
    • Find the latest running image: gcloud compute instances list --project=open-targets-genetics-dev | grep -i run | grep [elasticsearch|clickhouse] | awk '{ print $1 }' | tail -1
    • Bake image using scripts in genetics-backend/gcp/bake_[es|ch]_node.sh with the image found above. These create disk images which we can deploy using the Terraform defined in the genetics terraform repo
      • For example:
        • ./bake_ch_node.sh $(gcloud compute instances list --project=open-targets-genetics-dev | grep -i run | grep clickhouse | awk '{ print $1 }' | tail -1)
        • ./bake_es_node.sh $(gcloud compute instances list --project=open-targets-genetics-dev | grep -i run | grep elasticsearch | awk '{ print $1 }' | tail -1)

Sanity checks

  • You should check the size of the images and counts in ES and Clickhouse to get an idea of whether there were any problems in loading the data.
  • Clickhouse:
    • SSH into image: gcloud compute ssh --zone "europe-west1-c" "devgen2202-ch-11-clickhouse-gc34" --tunnel-through-iap --project "open-targets-genetics-dev" -- -L 8123:localhost:8123
    • Execute the following command (using either Clickhouse-client or another DB manager) to get counts:
SELECT table,
sum(rows) as rows,
formatReadableSize(sum(bytes)) as size
FROM system.parts
WHERE active
GROUP BY table;

The database in the 22.02 release shows:

┌─table──────────────────┬───────rows─┬─size───────┐
│ genes                  │      19569 │ 3.59 MiB   │
│ studies                │      50719 │ 2.08 MiB   │
│ variants               │   72858944 │ 5.41 GiB   │
│ v2d_by_stchr           │   20488888 │ 323.71 MiB │
│ v2d_sa_gwas            │  582828390 │ 29.51 GiB  │
│ v2g_structure          │          9 │ 3.40 KiB   │
│ v2d_coloc              │    4458533 │ 306.86 MiB │
│ l2g_by_gsl             │    3580861 │ 155.29 MiB │
│ v2d_credset            │   38834105 │ 1.34 GiB   │
│ v2d_by_chrpos          │   20488888 │ 414.83 MiB │
│ manhattan              │     279116 │ 44.22 MiB  │
│ v2g_scored             │ 1030927072 │ 20.09 GiB  │
│ d2v2g_scored           │ 1658712886 │ 41.05 GiB  │
│ studies_overlap        │   14570115 │ 154.52 MiB │
│ l2g_by_slg             │    3580861 │ 168.87 MiB │
│ v2d_sa_molecular_trait │  442006706 │ 14.63 GiB  │
└────────────────────────┴────────────┴────────────┘

As far as I know, we would not expect order of magnitude changes.

Updating Terraform XYZ

Using the genetics terraform repository:

For this use the master branch and remember to pull changes from the remote before making your changes

  • Create a new profile which will define the deployment.
    • cp profiles/deployment_context.devgen2111 profiles/deployment_context.devgen<release>
    • Update the release tag above, and change 2111 to match the most recent release number to minimise the number of changes we need to make.
  • Update the configuration in the devgen file created above. To see what fields are often changed you can look at the difference between previous releases with the command diff deployment_context.devgen2111 deployment_context.devgen2106. Fields that typically always need updating:
    • config_release_name: matches the context file name suffix
    • config_dns_subdomain_prefix: same as config_release_name
    • config_vm_elastic_search_image: Image you baked earlier
    • config_vm_clickhouse_image: Image you baked earlier
    • config_vm_api_image_version: latest API. From the API repository run git checkout master && git pull && git tag --list to see options. It's typically the last one.
    • config_vm_webapp_release: this will be the latest tagged version of the the web app
    • DEVOPS_CONTEXT_PLATFORM_APP_CONFIG_API_URL: update URL to include config_release_name.
  • Activate xyz profile
    • make tfactivate profile=xyz
  • Set remote backend (so multiple users can share state)
    • make tfbackendremote
  • Activate the deployment context you configured earlier.
    • make depactivate profile=devgen<release>
  • Download all dependencies
    • make tfinit
  • Check for existing Terraform state (things that are already deployed)
    • terraform state list. If this is the first time running these commands nothing will be displayed. After you have deployed the infrastructure running this command will show you what is currently available.
  • Inspect the plan: make tfplan. This will show you what Terraform plans to do
  • Execute the plan: make tfapply. Terraform will ask for confirmation of the changes.
  • Push your deployed changes to github so others can use them if necessary: git add profiles/deployment_context. devgen<release> && git commit -m "Deployment configuration for <release>" && git push

Recipe: Big Query

This step assumes that you have generated/collected all of the data as specified in the "get all inputs and run the ot-geckopipe" recipe.

  • If you don't have it already, clone the genetics output support repository
  • Update the variables under heading Variables for sync data in the config.tfvars file.
  • Run the shell command make bigquerydev
Configuration field Staging location (raw data from point of view of data joining) Dev location for running data-joining Notes
variant-index.raw provided by data team /variant-annotation//variant-annotation.parquet Copied from release to release, not updated since 2019
ensembl.lut generated by BE /lut/homo_sapiens_core_105_38_genes.json.gz This will be deprecated once we can use the Target Index from the ETL
vep.homo-sapiens-cons-scores recycled from previous release /lut/vep_consequences.tsv Copied from previous release
interval.path v2g/interval/* /v2g/interval/*/*/*/data.parquet Effectively static as we don't regenerate it. This has one of those annoying name.parquet components, but it's heavily nested and we can read a number higher level with a wildcard.
qtl.path v2g/qtl/YYMMDD/ v2g/qtl/
variant-gene.weights carried over from previous release lut/v2g_scoring_source_weights.date.json Copied from previous release: will be moved into ETL config in future
variant-disease.studies v2d/YYMMDD/studies.parquet v2d/studies.parquet Single file
variant-disease.toploci v2d/YYMMDD/toploci.parquet v2d/toploci.parquet Single file
variant-disease.finemapping v2d/YYMMDD/finemapping.parquet/ v2d/finemapping We want the input renamed to get rid of the '.parquet' component
variant-disease.ld v2d/YYMMDD/ld.parquet/ v2d/ld.parquet We want the input renamed to get rid of the '.parquet' component
variant-disease.overlapping v2d/YYMMDD/locus_overlap.parquet v2d/locus_overlap.parquet Single file
variant-disease.coloc coloc/YYMMDD/coloc_processed_w_betas.parquet/ v2d/coloc_processed_w_betas.parquet
variant-disease.trait_efo v2d/YYMMDD/trait_efo-2021-11-16.parquet v2d/trait_efo.parquet We want 'trait_efo' to not have the embedded date, as that is in the file path