Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AnVIL indexer doesn't follow downstream links from files to files #4761

Closed
14 tasks
nadove-ucsc opened this issue Nov 23, 2022 · 1 comment
Closed
14 tasks
Assignees
Labels
bug [type] A defect preventing use of the system as specified demo [process] To be demonstrated at the end of the sprint demoed [process] Successfully demonstrated to team indexer [subject] The indexer part of Azul orange [process] Done by the Azul team

Comments

@nadove-ucsc
Copy link
Contributor

self._downstream_from_files(source, entities['files'])

Uses the plural 'files' where the proper key is singular 'file'. Since the container is a defaultdict, an empty set is returned instead instead of raising a KeyError. This causes us to skip the step where we follow downstream links from files to other files. There are other bugs in this step that are not being detected as a result.

The impact appears to be low. When I reindexed my personal deployment (with the same source config as anvilbox) with a fix in place, the number of indexed files remained constant at 174. The fix I used was:

diff --git a/src/azul/plugins/repository/tdr_anvil/__init__.py b/src/azul/plugins/repository/tdr_anvil/__init__.py
index 9ee680e7..fc5c5910 100644
--- a/src/azul/plugins/repository/tdr_anvil/__init__.py
+++ b/src/azul/plugins/repository/tdr_anvil/__init__.py
@@ -229,7 +229,7 @@ class Plugin(TDRPlugin):
         return bundle_entity
 
     def _consolidate_by_type(self, entities: Keys) -> MutableKeysByType:
-        result = defaultdict(set)
+        result = {entity_type: set() for entity_type in self.indexed_columns_by_entity_type}
         for e in entities:
             result[e.entity_type].add(e.key)
         return result
@@ -259,7 +259,7 @@ class Plugin(TDRPlugin):
                            ) -> Links:
         return set.union(
             self._downstream_from_biosamples(source, entities['biosample']),
-            self._downstream_from_files(source, entities['files'])
+            self._downstream_from_files(source, entities['file'])
         )
 
     def _upstream_from_biosamples(self,
@@ -407,8 +407,8 @@ class Plugin(TDRPlugin):
         rows = self._run_sql(f'''
             WITH activities AS (
                 SELECT
-                    ala.alignmentactivity_id,
-                    'alignmentactivity',
+                    ala.alignmentactivity_id AS activity_id,
+                    'alignmentactivity' AS activity_table,
                     ala.used_file_id,
                     ala.generated_file_id
                 FROM {backtick(self._full_table_name(source, 'alignmentactivity'))} AS ala
@@ -433,7 +433,7 @@ class Plugin(TDRPlugin):
                             KeyReference(key=file_id, entity_type='file')
                             for file_id in row['generated_file_id']
                         ],
-                        activity=KeyReference(key=row['actvity_id'], entity_type=row['activity_table']))
+                        activity=KeyReference(key=row['activity_id'], entity_type=row['activity_table']))
             for row in rows
         }
 
@@ -442,6 +442,8 @@ class Plugin(TDRPlugin):
                            entity_type: EntityType,
                            keys: AbstractSet[Key],
                            ) -> MutableJSONs:
+        if not keys:
+            return []
         table_name = self._full_table_name(source, entity_type)
         columns = set.union(
             self.common_indexed_columns,
  • Security design review completed; the Resolution of this issue does not
    • … affect authentication; for example:
      • OAuth 2.0 with the application (API or Swagger UI)
      • Authentication of developers with Google Cloud APIs
      • Authentication of developers with AWS APIs
      • Authentication with a GitLab instance in the system
      • Password and 2FA authentication with GitHub
      • API access token authentication with GitHub
      • Authentication with
    • … affect the permissions of internal users like access to
      • Cloud resources on AWS and GCP
      • GitLab repositories, projects and groups, administration
      • an EC2 instance via SSH
      • GitHub issues, pull requests, commits, commit statuses, wikis, repositories, organizations
    • … affect the permissions of external users like access to
      • TDR snapshots
    • … affect permissions of service or bot accounts
      • Cloud resources on AWS and GCP
    • … affect audit logging in the system, like
      • adding, removing or changing a log message that represents an auditable event
      • changing the routing of log messages through the system
    • … affect monitoring of the system
    • … introduce a new software dependency like
      • Python packages on PYPI
      • Command-line utilities
      • Docker images
      • Terraform providers
    • … add an interface that exposes sensitive or confidential data at the security boundary
    • … affect the encryption of data at rest
    • … require persistence of sensitive or confidential data that might require encryption at rest
    • … require unencrypted transmission of data within the security boundary
    • … affect the network security layer; for example by
      • modifying, adding or removing firewall rules
      • modifying, adding or removing security groups
      • changing or adding a port a service, proxy or load balancer listens on
  • Documentation on any unchecked boxes is provided in comments below
@nadove-ucsc nadove-ucsc added the orange [process] Done by the Azul team label Nov 23, 2022
@dsotirho-ucsc dsotirho-ucsc added bug [type] A defect preventing use of the system as specified indexer [subject] The indexer part of Azul labels Nov 23, 2022
@hannes-ucsc
Copy link
Member

For demo, find a file that is derived from a file (in a TDR snapshot) and show that it shows up in the service. IIRC, the v4 snapshots have some of those. This means that we'll have to wait with the demo for PR #4741 for #4617 to land.

@hannes-ucsc hannes-ucsc added demo blocked [process] Demo is blocked by ongoing work demo [process] To be demonstrated at the end of the sprint labels Nov 30, 2022
@nadove-ucsc nadove-ucsc removed the demo blocked [process] Demo is blocked by ongoing work label Dec 13, 2022
@hannes-ucsc hannes-ucsc added the demoed [process] Successfully demonstrated to team label Dec 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug [type] A defect preventing use of the system as specified demo [process] To be demonstrated at the end of the sprint demoed [process] Successfully demonstrated to team indexer [subject] The indexer part of Azul orange [process] Done by the Azul team
Projects
None yet
Development

No branches or pull requests

3 participants