Lifelike staging az deployment becomes master (#2213)

* Convert `get_genes_to_organisms` to AQL * Convert `get_proteins_to_organisms` to AQL * Convert `get_global_inclusions_count` to AQL * Convert `get_global_inclusions` to AQL * Convert `get_global_inclusions_paginated` to AQL * Convert `get_docs_by_ids` to AQL * Convert `get_mesh_by_ids` to AQL * Convert `get_node_labels_and_relationship` to AQL * Convert `global_inclusions_by_type` queries to AQL * Convert `_global_annotation_exists_in_kg` to AQL * Update goc queries for globals * Fix linting * Skip annotations pytests This is a temporary measure to allow CI to pass without needing to refactor every annotations test which previously used Neo4j. * Fix bug in `get_global_inclusions_by_type_query` * Remove unnecessary graph services from annotations * Update ET to use Arango * Resolve mypy & pycodestyle issues * Update organism search to use AQL * Update visualizer search to use AQL * Update synonym search to use AQL * Remove `SearchService` No longer needed as it isn't used anywhere. * Update viz expansion/snippets queries to use AQL * Fix bug in batch uri request api * Ignore failing visualizer tests Just silencing these until after the arango integration is complete. * Fix lint issues * Update id properties to use `IdType` This should help if in the future we decide to go back to using numbers, we won't have to change every single instance of a type definition. * Fix bad type definition in sidenav-type view * Move import to correct group * Add `verify_override` argument to ArangoClient * Remove KgService and Pathway Browser Since this feature is mostly a prototype and we're moving away from Neo4j anyway, doesn't really make sense to keep it. * Remove remaining references to neo4j package Removes (almost) all remaining references to the neo4j python driver in the appserver. There are still a few references in the pytests, these can be ironed out when the tests themselves are updated in the near future. * Add arango driver to stats-enrichment pipfile * Update SE to use arango * Remove neo4j & py2neo from SE pipfile * Fix pagination bug in visualizer search * Fix bug in visualizer expansion query * Update cache-invalidator to use Arango * Add correct sorting to visualizer search query * Fix sorting in misc. visualizer snippet queries * Remove possibly unnecessary `exec $@` call * Add sanity check log after dbs have started * Remove n4j container & update startup w/ arango * Test ansible workflow changes (add verbosity) * Re-format relevant visualizer JS files * Update expand to bulk create reference tables * Fix input/output errors in visualization cmp * Fix direction bug in reference table creation * Improve visualizer expand timing Consolidates the expand/reference table requests into a single one. Also adds some small performance improvements on the client. * Add improved association matching to viz queries * Add very slight perf improvement to snippets query * Add starts-with phrase search to viz search * Add perf update to node pair snippets query * Fix domain labels not appearing in viz search * Fix mypy errors * Remove accidental workflow changes * Fix ChEBI typo in constants * Fix pagination issue in viz search * Clean up a few files * Update arango conftests + update initial tests * Update manual annotations tests * Update neo4j api tests * Update database annotations tests * Use camel_to_snake_dict instead of recent change * Update remaining visualizer tests * Add comment to redis queue tests * Fix appserver linting checks * Fix client lint checks * Update enrichment queries to loop over inputs We noticed that when using an `IN` clause, the results were being sorted in seemingly alphabetical order. To avoid this, we changed the queries to iterate over the inputs to return the results in input order. * Add perf. improvement to stats-enrichment query * Add perf. improvement to anno fallback queries * Fix typo in genes to organism query * Add another date conversion case for arango data * Fix bad error handling in global creation * Generalize arango date format checks * Remove debug comments * Fix linter issues * Fix appserver issues * Fix incorrect property name in go term query * Update deployment submodule (add AQL vars to qa) * Update deployment submodule (rm bad pip installs) * Update deployment submodule (demo vars) * Increase ansible log verbosity for debugging * Fix typo * Update deployment submodule (rm apm vars) * Update deployment submodule (switch to gpt branch) * Update deployment submodule (add gpt key to demo) * Fix pycodestyle errors * Fix pytest * Add auto lint changes for pytest update * Fix prettier & black warnings * Fix additional black/prettier errors * Fix code style issues with Black * Fix code style issues with Prettier * Remove old `util.py` in favor of new files * Rename `neo4j_test.py` to `arango_test.py` * Fix flake8 issue * Fix silent bugs in visualizer search * Fix typo in associated type snippet count query * Fix merge conflict in visualizer search records * Change deployment submodule ref * Remove unnecessary secret from az workflows * Update deployment submodule (fix bad value) * Update deployment submodule (add docker-az role) * Update deployment submodule (fix bad file location) * Update deployment submodule (re-add JWKS vars) * Update deployment submodule (+ openai vars to env) * Update deployment submodule (add networks) * Update deployment submodule (edit JWKS vars) * Update deployment submodule (remove vals from JWKS) * Add empty string defaults for JWKS flask config * Fix badly merged changes for LL-5300 * Add synonym field to organism query result * Update deployment to latest lifelike-public-temp-fix * Update deployment submodule (update frontend host) * Update deployment submodule (update stg app version) * Update deployment submodule to latest * Update deployment submodule (rm var from env) * Update deployment submodule (update flask env) * Fix code style issues with Prettier --------- Co-authored-by: Ethan Sanchez <ethan.dsanch@gmail.com> Co-authored-by: Lint Action <lint-action@samuelmeuli.com>
biosustain · Jan 25, 2024 · 6511992 · 6511992
1 parent e04c648
commit 6511992
Show file tree

Hide file tree

Showing 166 changed files with 5,808 additions and 916,431 deletions.
diff --git a/.github/workflows/deployment-az-public.yml b/.github/workflows/deployment-az-public.yml
@@ -18,5 +18,4 @@ jobs:
       SSH_KEY: ${{ secrets.ANSIBLE_PRIVATE_SSH_KEY }}
       CONTAINER_REGISTRY_USERNAME: ${{ secrets.AZURE_CR_USERNAME }}
       CONTAINER_REGISTRY_PASSWORD: ${{ secrets.AZURE_CR_PASSWORD }}
-      GCP_CREDENTIALS: ${{ secrets.GCE_SA_KEY }}
       INFRA_PAT: ${{ secrets.INFRA_PAT }}
diff --git a/.github/workflows/deployment-az.yml b/.github/workflows/deployment-az.yml
@@ -26,8 +26,6 @@ on:
         required: true
       SSH_KEY:
         required: true
-      GCP_CREDENTIALS:
-        required: true
       INFRA_PAT:
         required: true
 

diff --git a/.github/workflows/deployment-gcp.yml b/.github/workflows/deployment-gcp.yml
@@ -156,4 +156,4 @@ jobs:
             --extra-vars github_run_id=${{ github.run_id }}
             --extra-vars postgres_host=${{ steps.database-host.outputs.ip_address }}
             --user ansible
-            --verbose
+            -vvvv
diff --git a/appserver/Dockerfile b/appserver/Dockerfile
@@ -33,6 +33,7 @@ COPY --chown=1000:1000 . .
 ENV PYTHONUNBUFFERED 1
 ENV PYTHONPATH $N4J_HOME
 
+# Set Python3 as the default when running "python"
 RUN echo 'alias python=python3' >> ~/.bashrc && source ~/.bashrc
 
 USER $N4J_USER

diff --git a/appserver/bin/startup.sh b/appserver/bin/startup.sh
@@ -9,7 +9,7 @@ if [ "${FLASK_ENV}" = "development" ] && [ "${FLASK_APP_CONFIG}" = "Development"
     # wait for postgres
     timeout 300 ${__dir__}/wait-for-postgres
     # wait for neo4j
-    timeout 300 ${__dir__}/wait-for-neo4j
+    timeout 300 ${__dir__}/wait-for-arango
     #wait for elastic
     timeout 300 ${__dir__}/wait-for-elastic
     # setup db

diff --git a/appserver/bin/wait-for-arango b/appserver/bin/wait-for-arango
@@ -0,0 +1,16 @@
+#!/bin/bash
+
+echo "Waiting for Arango"
+
+ARANGO_STATUS="000"
+
+until [ "$ARANGO_STATUS" = "200" ]
+do
+    ARANGO_STATUS=`curl -s -o /dev/null -I -w "%{http_code}" --basic --user "${ARANGO_USERNAME}:${ARANGO_PASSWORD}" -X GET ${ARANGO_HOST}/_api/endpoint`
+    echo "Status of Arango: $ARANGO_STATUS"
+    sleep 2
+done
+
+# Run command | https://docs.docker.com/compose/startup-order/
+>&2 echo "Arango started - executing command"
+exec $@
diff --git a/appserver/config.py b/appserver/config.py
@@ -18,10 +18,10 @@ class Base:
     APP_VERSION = os.environ.get('APP_VERSION', 'undefined')
     LOGGING_LEVEL = os.environ.get('LOGGING_LEVEL', logging.INFO)
 
-    JWKS_URL = os.environ.get('JWKS_URL', None)
-    JWT_SECRET = os.environ.get('JWT_SECRET', 'secrets')
-    JWT_AUDIENCE = os.environ.get('JWT_AUDIENCE', None)
-    JWT_ALGORITHM = os.environ.get('JWT_ALGORITHM', 'HS256')
+    JWKS_URL = os.environ.get('JWKS_URL', None) or None
+    JWT_SECRET = os.environ.get('JWT_SECRET', 'secrets') or 'secrets'
+    JWT_AUDIENCE = os.environ.get('JWT_AUDIENCE', None) or None
+    JWT_ALGORITHM = os.environ.get('JWT_ALGORITHM', 'HS256') or 'HS256'
 
     POSTGRES_HOST = os.environ.get('POSTGRES_HOST')
     POSTGRES_PORT = os.environ.get('POSTGRES_PORT')

diff --git a/appserver/migrations/utils.py b/appserver/migrations/utils.py
@@ -1,9 +1,14 @@
+from arango.client import ArangoClient
 import multiprocessing as mp
+from typing import Dict, List
+
+from neo4japp.database import get_or_create_arango_client
 
 # flake8: noqa: OIG001 # It is legacy file with imports from appserver which we decided to not fix
 from neo4japp.models import Files
-from neo4japp.services.annotations.initializer import get_annotation_graph_service
 from neo4japp.services.annotations.constants import EntityType
+from neo4japp.services.annotations.utils.graph_queries import get_docs_by_ids_query
+from neo4japp.services.arangodb import execute_arango_query, get_db
 
 
 def window_chunk(q, windowsize=100):
@@ -20,6 +25,34 @@ def window_chunk(q, windowsize=100):
         yield chunk
 
 
+# NOTE DEPRECATED: just used in old migration
+def _get_mesh_by_ids_query():
+    return """
+    FOR doc IN mesh
+        FILTER 'TopicalDescriptor' IN doc.labels
+        FILTER doc.eid IN @ids
+        RETURN {'mesh_id': doc.eid, 'mesh_name': doc.name}
+    """
+
+
+def _get_mesh_from_mesh_ids(
+    arango_client: ArangoClient, mesh_ids: List[str]
+) -> Dict[str, str]:
+    result = execute_arango_query(
+        db=get_db(arango_client), query=_get_mesh_by_ids_query(), ids=mesh_ids
+    )
+    return {row['mesh_id']: row['mesh_name'] for row in result}
+
+
+def _get_nodes_from_node_ids(
+    arango_client: ArangoClient, entity_type: str, node_ids: List[str]
+) -> Dict[str, str]:
+    result = execute_arango_query(
+        db=get_db(arango_client), query=get_docs_by_ids_query(entity_type), ids=node_ids
+    )
+    return {row['entity_id']: row['entity_name'] for row in result}
+
+
 def get_primary_names(annotations):
     """Copied from AnnotationService.add_primary_name"""
     chemical_ids = set()
@@ -30,7 +63,7 @@ def get_primary_names(annotations):
     organism_ids = set()
     mesh_ids = set()
 
-    neo4j = get_annotation_graph_service()
+    arango_client = get_or_create_arango_client()
     updated_annotations = []
 
     # Note: We need to split the ids by colon because
@@ -77,25 +110,25 @@ def get_primary_names(annotations):
                 organism_ids.add(meta_id)
 
     try:
-        chemical_names = neo4j.get_nodes_from_node_ids(
-            EntityType.CHEMICAL.value, list(chemical_ids)
-        )  # noqa
-        compound_names = neo4j.get_nodes_from_node_ids(
-            EntityType.COMPOUND.value, list(compound_ids)
-        )  # noqa
-        disease_names = neo4j.get_nodes_from_node_ids(
-            EntityType.DISEASE.value, list(disease_ids)
+        chemical_names = _get_nodes_from_node_ids(
+            arango_client, EntityType.CHEMICAL.value, list(chemical_ids)
+        )
+        compound_names = _get_nodes_from_node_ids(
+            arango_client, EntityType.COMPOUND.value, list(compound_ids)
+        )
+        disease_names = _get_nodes_from_node_ids(
+            arango_client, EntityType.DISEASE.value, list(disease_ids)
+        )
+        gene_names = _get_nodes_from_node_ids(
+            arango_client, EntityType.GENE.value, list(gene_ids)
         )
-        gene_names = neo4j.get_nodes_from_node_ids(
-            EntityType.GENE.value, list(gene_ids)
+        protein_names = _get_nodes_from_node_ids(
+            arango_client, EntityType.PROTEIN.value, list(protein_ids)
         )
-        protein_names = neo4j.get_nodes_from_node_ids(
-            EntityType.PROTEIN.value, list(protein_ids)
+        organism_names = _get_nodes_from_node_ids(
+            arango_client, EntityType.SPECIES.value, list(organism_ids)
         )
-        organism_names = neo4j.get_nodes_from_node_ids(
-            EntityType.SPECIES.value, list(organism_ids)
-        )  # noqa
-        mesh_names = neo4j.get_mesh_from_mesh_ids(list(mesh_ids))
+        mesh_names = _get_mesh_from_mesh_ids(arango_client, list(mesh_ids))
     except Exception:
         raise
 

diff --git a/appserver/neo4japp/blueprints/annotations.py b/appserver/neo4japp/blueprints/annotations.py
@@ -73,7 +73,6 @@
 from ..services.annotations.initializer import (
     get_annotation_service,
     get_annotation_db_service,
-    get_annotation_graph_service,
     get_annotation_tokenizer,
     get_bioc_document_service,
     get_enrichment_annotation_service,
@@ -91,6 +90,7 @@
     get_global_inclusions_query,
     get_global_inclusions_count_query,
 )
+from ..services.arangodb import convert_datetime, execute_arango_query, get_db
 from ..services.enrichment.data_transfer_objects import EnrichmentCellTextMapping
 from ..services.filesystem import Filesystem
 from ..utils.globals import get_current_user
@@ -656,7 +656,6 @@ def _annotate(
         pipeline = Pipeline(
             {
                 'adbs': get_annotation_db_service,
-                'ags': get_annotation_graph_service,
                 'aers': get_recognition_service,
                 'tkner': get_annotation_tokenizer,
                 'as': get_annotation_service,
@@ -700,7 +699,6 @@ def _annotate_enrichment_table(
         pipeline = Pipeline(
             {
                 'adbs': get_annotation_db_service,
-                'ags': get_annotation_graph_service,
                 'aers': get_recognition_service,
                 'tkner': get_annotation_tokenizer,
                 'as': get_enrichment_annotation_service,
@@ -881,8 +879,11 @@ class GlobalAnnotationExportInclusions(MethodView):
     def get(self):
         yield g.current_user
 
-        graph = get_annotation_graph_service()
-        inclusions = graph.exec_read_query(get_global_inclusions_query())
+        arango_client = get_or_create_arango_client()
+        inclusions = execute_arango_query(
+            db=get_db(arango_client),
+            query=get_global_inclusions_query(),
+        )
 
         file_uuids = {inclusion['file_reference'] for inclusion in inclusions}
         file_data_query = db.session.query(
@@ -891,7 +892,7 @@ def get(self):
 
         file_uuids_map = {d.file_uuid: d.file_deleted_by for d in file_data_query}
 
-        def get_inclusion_for_review(inclusion, file_uuids_map, graph):
+        def get_inclusion_for_review(inclusion, file_uuids_map):
             user = AppUser.query.filter_by(
                 id=file_uuids_map[inclusion['file_reference']]
             ).one_or_none()
@@ -905,9 +906,7 @@ def get_inclusion_for_review(inclusion, file_uuids_map, graph):
                 'file_uuid': inclusion['file_reference'],
                 'file_deleted': deleter,
                 'type': ManualAnnotationType.INCLUSION.value,
-                'creation_date': str(
-                    graph.convert_datetime(inclusion['creation_date'])
-                ),
+                'creation_date': convert_datetime(inclusion['creation_date']),
                 'text': inclusion['synonym'],
                 'case_insensitive': True,
                 'entity_type': inclusion['entity_type'],
@@ -917,7 +916,7 @@ def get_inclusion_for_review(inclusion, file_uuids_map, graph):
             }
 
         data = [
-            get_inclusion_for_review(inclusion, file_uuids_map, graph)
+            get_inclusion_for_review(inclusion, file_uuids_map)
             for inclusion in inclusions
             if inclusion['file_reference'] in file_uuids_map
         ]
@@ -1036,10 +1035,12 @@ def get(self, params, global_type):
             ]
             query_total = exclusions.total
         else:
-            graph = get_annotation_graph_service()
-            global_inclusions = graph.exec_read_query_with_params(
-                get_global_inclusions_paginated_query(),
-                {'skip': 0 if page == 1 else (page - 1) * limit, 'limit': limit},
+            arango_client = get_or_create_arango_client()
+            global_inclusions = execute_arango_query(
+                db=get_db(arango_client),
+                query=get_global_inclusions_paginated_query(),
+                skip=0 if page == 1 else (page - 1) * limit,
+                limit=limit,
             )
 
             file_uuids = {
@@ -1065,7 +1066,7 @@ def get(self, params, global_type):
                     if file_uuids_map.get(i['file_reference'], True)
                     else False,
                     'type': ManualAnnotationType.INCLUSION.value,
-                    'creation_date': graph.convert_datetime(i['creation_date']),
+                    'creation_date': convert_datetime(i['creation_date']),
                     'text': i['synonym'],
                     'case_insensitive': True,
                     'entity_type': i['entity_type'],
@@ -1075,9 +1076,11 @@ def get(self, params, global_type):
                 }
                 for i in global_inclusions
             ]
-            query_total = graph.exec_read_query(get_global_inclusions_count_query())[0][
-                'total'
-            ]
+
+            query_total = execute_arango_query(
+                db=get_db(arango_client),
+                query=get_global_inclusions_count_query(),
+            )[0]['total']
 
         results = {'total': query_total, 'results': data}
         yield jsonify(GlobalAnnotationListSchema().dump(results))

diff --git a/appserver/neo4japp/blueprints/enrichment_table.py b/appserver/neo4japp/blueprints/enrichment_table.py
@@ -1,8 +1,12 @@
 from http import HTTPStatus
 
 from flask import Blueprint, request, jsonify
+import numpy as np
+from pandas import DataFrame
 
-from neo4japp.database import get_enrichment_table_service
+from neo4japp.constants import KGDomain
+from neo4japp.database import get_or_create_arango_client
+from neo4japp.services.enrichment.enrichment_table import get_genes, match_ncbi_genes
 
 
 bp = Blueprint('enrichment-table-api', __name__, url_prefix='/enrichment-table')
@@ -17,10 +21,37 @@ def match_ncbi_nodes():
     nodes = []
 
     if organism is not None and gene_names is not None:
-        enrichment_table = get_enrichment_table_service()
+        arango_client = get_or_create_arango_client()
         # list(dict...) is to drop duplicates, but want to keep order
-        nodes = enrichment_table.match_ncbi_genes(
-            list(dict.fromkeys(gene_names)), organism
+        nodes = match_ncbi_genes(
+            arango_client, list(dict.fromkeys(gene_names)), organism
         )
 
+    return jsonify({'result': nodes}), 200
+
+
+@bp.route('/get-ncbi-nodes/enrichment-domains', methods=['POST'])
+def get_ncbi_enrichment_domains():
+    """Find all domains matched to given node id, then return dictionary with all domains as
+    result. All domains should have matching indices e.g. regulon[1] should be data from
+    matching same node as uniprot[1].
+    """
+    # TODO: Validate incoming data using webargs + Marshmallow
+    data = request.get_json()
+    doc_ids = data.get('docIds')
+    tax_id = data.get('taxID')
+    domains = data.get('domains')
+
+    if doc_ids is not None and tax_id is not None:
+        arango_client = get_or_create_arango_client()
+        domain_nodes = {
+            domain.lower(): get_genes(arango_client, KGDomain(domain), doc_ids, tax_id)
+            for domain in domains
+        }
+        df = DataFrame(domain_nodes).replace({np.nan: None}).transpose()
+        # Redundant but just following old implementation
+        nodes = df.append(df.columns.to_series(name='doc_id')).to_dict()
+    else:
+        nodes = {}
+
     return jsonify({'result': nodes}), HTTPStatus.OK
diff --git a/appserver/neo4japp/blueprints/entity_resources.py b/appserver/neo4japp/blueprints/entity_resources.py
@@ -1,6 +1,7 @@
 from flask import Blueprint, request
 
-from neo4japp.models import AnnotationStyle, DomainURLsMap
+from neo4japp.constants import DOMAIN_URLS_MAP
+from neo4japp.models import AnnotationStyle
 
 bp = Blueprint('entity-resources', __name__, url_prefix='/entity-resources')
 
@@ -31,8 +32,8 @@ def get_uri():
     """
     payload = request.json
 
-    uri = DomainURLsMap.query.filter_by(domain=payload['domain'])[0]
-    return {'uri': uri.base_URL.format(payload['term'])}
+    uri = DOMAIN_URLS_MAP[payload['domain']]
+    return {'uri': uri.format(payload['term'])}
 
 
 @bp.route('/uri/batch', methods=['POST'])
@@ -60,7 +61,7 @@ def get_uri_batch():
     uris = []
     payload = request.json
     for entry in payload['batch']:
-        uri = DomainURLsMap.query.filter_by(domain=entry['domain'])[0]
-        uris.append({'uri': uri.base_URL.format(entry['term'])})
+        uri = DOMAIN_URLS_MAP[entry['domain']]
+        uris.append({'uri': uri.format(entry['term'])})
 
     return {'batch': uris}