Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HCA replicas lack DRS URI and descriptor #6299

Open
nadove-ucsc opened this issue May 28, 2024 · 18 comments
Open

HCA replicas lack DRS URI and descriptor #6299

nadove-ucsc opened this issue May 28, 2024 · 18 comments
Assignees
Labels
- [priority] Medium bug [type] A defect preventing use of the system as specified debt [type] A defect incurring continued engineering cost manifests [subject] Generation and contents of manifests orange [process] Done by the Azul team spike:8 [process] Spike estimate of eight points

Comments

@nadove-ucsc
Copy link
Contributor

nadove-ucsc commented May 28, 2024

#6299 (comment)

@nadove-ucsc nadove-ucsc added the orange [process] Done by the Azul team label May 28, 2024
@nadove-ucsc
Copy link
Contributor Author

Assignee to add reproduction using JSONL manifest, and BigQuery output from files table for reference.

@nadove-ucsc nadove-ucsc self-assigned this May 28, 2024
@nadove-ucsc
Copy link
Contributor Author

nadove-ucsc commented May 29, 2024

Here is an example of file metadata containing a DRS URI and descriptor, for this file:

select * from `datarepo-f4d7c97e.hca_prod_2ad191cdbd7a409b9bd1e72b5e4cce81__20220111_dcp2_20220114_dcp12.sequence_file`
limit 1

With serialized JSON strings expanded in the content and descriptor columns for legibility:

{
    "datarepo_row_id": "3a3f3042-672c-48d6-917e-ad7fede70096",
    "content": {
        "describedBy": "https://schema.humancellatlas.org/type/file/9.3.1/sequence_file",
        "schema_type": "file",
        "file_core": {
            "file_name": "SRR13509367_83942_MissingLibrary_1_HNJJVBBXX_R1.fastq.gz",
            "format": "fastq.gz",
            "content_description": [
                {
                    "text": "DNA sequence",
                    "ontology": "data:3494",
                    "ontology_label": "DNA sequence"
                }
            ]
        },
        "read_index": "read1",
        "lane_index": 9,
        "insdc_run_accessions": [
            "SRR13509367"
        ],
        "provenance": {
            "document_id": "74df138b-a13b-412c-a9c5-f80336e42c89",
            "submission_date": "2021-11-26T15:22:17.318Z",
            "update_date": "2021-11-29T10:27:18.731Z",
            "schema_major_version": 9,
            "schema_minor_version": 3
        }
    },
    "file_id": "drs://data.terra.bio/v1_aeb2bd6b-e22b-404e-b18a-ea478a6c3419_ee74e15b-107b-4065-adef-a54684cb0883",
    "version": "2021-11-26 15:22:17.318000 UTC",
    "sequence_file_id": "74df138b-a13b-412c-a9c5-f80336e42c89",
    "descriptor": {
        "file_id": "4e05daaa-13f4-4aa4-8f85-a8b716cdd3f8",
        "file_version": "2021-11-26T15:22:17.318000Z",
        "file_name": "SRR13509367_83942_MissingLibrary_1_HNJJVBBXX_R1.fastq.gz",
        "content_type": "binary/octet-stream; dcp-type=data",
        "size": 8981108947,
        "sha1": "7744af1664b2bc5145bf9ed8c5d7e3050185f8f9",
        "sha256": "561732126211c60d914bdeee6daadbd3f894663fc94f1cc695e26bcff2388d09",
        "crc32c": "9f52041d",
        "s3_etag": "4e8c0927af738182d64bd03b0f388896-134",
        "schema_type": "file_descriptor",
        "describedBy": "https://schema.humancellatlas.org/system/2.0.0/file_descriptor",
        "schema_version": "2.0.0"
    }
}

@nadove-ucsc
Copy link
Contributor Author

The JSONL manifest from filtering for only that file:

curl -X 'PUT' \
  'https://service.azul.data.humancellatlas.org/fetch/manifest/files?catalog=dcp37&filters=%7B%0A%20%20%22fileName%22%3A%20%7B%0A%20%20%20%20%22is%22%3A%20%5B%0A%20%20%20%20%20%20%22SRR13509367_83942_MissingLibrary_1_HNJJVBBXX_R1.fastq.gz%22%0A%20%20%20%20%5D%0A%20%20%7D%0A%7D&format=verbatim.jsonl' \
  -H 'accept: application/json'
$ cat Desktop/hca-manifest-c88d5f47-dc61-5397-9172-54660398122d.87b5592e-c7c2-5da8-98fc-5c39d257c885.jsonl | jq 'select(.type == "file")'
{
  "contents": {
    "describedBy": "https://schema.humancellatlas.org/type/file/9.3.1/sequence_file",
    "schema_type": "file",
    "file_core": {
      "file_name": "SRR13509367_83942_MissingLibrary_1_HNJJVBBXX_R1.fastq.gz",
      "format": "fastq.gz",
      "content_description": [
        {
          "text": "DNA sequence",
          "ontology": "data:3494",
          "ontology_label": "DNA sequence"
        }
      ]
    },
    "read_index": "read1",
    "lane_index": 9,
    "insdc_run_accessions": [
      "SRR13509367"
    ],
    "provenance": {
      "document_id": "74df138b-a13b-412c-a9c5-f80336e42c89",
      "submission_date": "2021-11-26T15:22:17.318Z",
      "update_date": "2021-11-29T10:27:18.731Z",
      "schema_major_version": 9,
      "schema_minor_version": 3
    }
  },
  "type": "file"
}

@nadove-ucsc nadove-ucsc removed their assignment May 29, 2024
@dsotirho-ucsc
Copy link
Contributor

@hannes-ucsc: "Depending on the outcome of the on-going Slack discussion with the Terra import team, we could either produce just the content, descriptor and file_id columns (for file tables, just content for other tables), or completely flatten the metadata into a dictionary where the keys are dotted paths and the values are arrays. We could also offer two different flavors of the verbatim PFB manifest."

@dsotirho-ucsc
Copy link
Contributor

Assignee to consider next steps.

@hannes-ucsc
Copy link
Member

hannes-ucsc commented Jul 12, 2024

The issue is that we currently only include the value of the contents column from the snapshot's BQ table. To include the file_id and descriptor columns we would have to invent column names for them that don't conflict with any top-level property names from the HCA schema. That seems arbitrary and brittle.

The completely flattened approach ("file_core.content_description.text": "DNA sequence") would be awkward to use in case of values that are arrays . The column name, essentially a JSON path, would have to include the [] notation, leading to column names not known in advance, or the column values would have to be arrays, requiring users to correlate array indices (the n-th value in column a.b.c corresponds to the n-th value in column a.b.c). This would be further complicated by the question of handling arrays whose values are dictionaries containing dictionaries or arrays.

A partially flattened approach seems like the best compromise. Knowing the HCA schema, it appears that a flattening depth (number of dots in the column name) of 2 would be most useful. To avoid the array complexities described above, the flattening would have to be shallower when it encounters an array value. The resulting replica entity in the PFB would look as follows:

{
    "datarepo_row_id": "3a3f3042-672c-48d6-917e-ad7fede70096",
    "file_id": "drs://data.terra.bio/v1_aeb2bd6b-e22b-404e-b18a-ea478a6c3419_ee74e15b-107b-4065-adef-a54684cb0883",
    "content.describedBy": "https://schema.humancellatlas.org/type/file/9.3.1/sequence_file",
    "content.schema_type": "file",
    "content.file_core.file_name": "SRR13509367_83942_MissingLibrary_1_HNJJVBBXX_R1.fastq.gz",
    "content.file_core.format": "fastq.gz",
    "content.file_core.content_description": [
        {
            "text": "DNA sequence",
            "ontology": "data:3494",
            "ontology_label": "DNA sequence"
        }
    ],
    "content.read_index": "read1",
    "content.lane_index": 9,
    "content.insdc_run_accessions": [
        "SRR13509367"
    ],
    "content.provenance.document_id": "74df138b-a13b-412c-a9c5-f80336e42c89",
    "content.provenance.submission_date": "2021-11-26T15:22:17.318Z",
    "content.provenance.update_date": "2021-11-29T10:27:18.731Z",
    "content.provenance.schema_major_version": 9,
    "content.provenance.schema_minor_version": 3,
    "descriptor.file_id": "4e05daaa-13f4-4aa4-8f85-a8b716cdd3f8",
    "descriptor.file_version": "2021-11-26T15:22:17.318000Z",
    "descriptor.file_name": "SRR13509367_83942_MissingLibrary_1_HNJJVBBXX_R1.fastq.gz",
    "descriptor.content_type": "binary/octet-stream; dcp-type=data",
    "descriptor.size": 8981108947,
    "descriptor.sha1": "7744af1664b2bc5145bf9ed8c5d7e3050185f8f9",
    "descriptor.sha256": "561732126211c60d914bdeee6daadbd3f894663fc94f1cc695e26bcff2388d09",
    "descriptor.crc32c": "9f52041d",
    "descriptor.s3_etag": "4e8c0927af738182d64bd03b0f388896-134",
    "descriptor.schema_type": "file_descriptor",
    "descriptor.describedBy": "https://schema.humancellatlas.org/system/2.0.0/file_descriptor",
    "descriptor.schema_version": "2.0.0"
}

This approach is bijective (aka lossless aka reversible), doesn't lead to an excessive number of columns, solves this issue, and doesn't require inventing column names to avoid naming conflicts.

@hannes-ucsc
Copy link
Member

hannes-ucsc commented Jul 12, 2024

Spike to review above comment and determine if dots are allowed in AvroPFB property names, and if those are retained on import, resulting in dots in workspace table column names.

@hannes-ucsc hannes-ucsc added the spike:2 [process] Spike estimate of two points label Jul 12, 2024
@hannes-ucsc hannes-ucsc removed their assignment Jul 12, 2024
@achave11-ucsc achave11-ucsc added spike:3 [process] Spike estimate of three points and removed spike:2 [process] Spike estimate of two points labels Jul 12, 2024
@nadove-ucsc
Copy link
Contributor Author

Terra does NOT allow dots in AvroPFB property names:

image

Workspace link

I tested this using the following patch. I used an AnVIL manifest instead of HCA because it was easier to change the PFB schema.

ee02daba2 (HEAD -> issues/nadove-ucsc/6299-hca-replicas-lack-drs-uri-descriptor) drop! Test support for . in PFB handover columns
diff --git a/src/azul/plugins/metadata/anvil/__init__.py b/src/azul/plugins/metadata/anvil/__init__.py
index 77fea54c3..c7e5f451f 100644
--- a/src/azul/plugins/metadata/anvil/__init__.py
+++ b/src/azul/plugins/metadata/anvil/__init__.py
@@ -304,6 +304,13 @@ class Plugin(MetadataPlugin[AnvilBundle]):
     def verbatim_pfb_schema(self,
                             replicas: Iterable[JSON]
                             ) -> tuple[Iterable[JSON], Sequence[str], JSON]:
+
+        def insert_dot(r):
+            r = r.copy()
+            r['contents'] = {'foo.bar.' + k: v for k, v in r['contents'].items()}
+            return r
+
+        return super().verbatim_pfb_schema(map(insert_dot, replicas))
         entity_schemas = []
         entity_types = []
         for table_schema in sorted(anvil_schema['tables'], key=itemgetter('name')):

Here is a preview of the manifest that failed to import (it contains only public entities):

$ pfb show -i Desktop/anvil-manifest-d25978db-c07b-59c7-942b-398ac4130346.ad14b32b-dea0-555e-b745-8ebe5bb0958d.avro  | jq . | head -50
{
  "id": "636f608f-459e-46ec-bad2-13ad1867686a",
  "name": "anvil_file",
  "object": {
    "foo.bar.crc32": "",
    "foo.bar.data_modality": [],
    "foo.bar.datarepo_row_id": "636f608f-459e-46ec-bad2-13ad1867686a",
    "foo.bar.drs_uri": "drs://jade.datarepo-dev.broadinstitute.org/v1_790795c4-49b1-4ac8-a060-207b92ea08c5_3c387d5a-69e3-4bd4-abd5-283a09a8ad09",
    "foo.bar.file_format": ".crai",
    "foo.bar.file_id": "3c387d5a-69e3-4bd4-abd5-283a09a8ad09",
    "foo.bar.file_md5sum": "To/CtLVpMUPpulH7wLrNOA==",
    "foo.bar.file_name": "NA19239.final.cram.crai",
    "foo.bar.file_ref": "drs://jade.datarepo-dev.broadinstitute.org/v1_790795c4-49b1-4ac8-a060-207b92ea08c5_3c387d5a-69e3-4bd4-abd5-283a09a8ad09",
    "foo.bar.file_size": 1324028,
    "foo.bar.is_supplementary": false,
    "foo.bar.reference_assembly": [],
    "foo.bar.sha256": "",
    "foo.bar.source_datarepo_row_ids": [
      "file_inventory:34d54283-56b1-4d99-8994-20da842dd8cc"
    ],
    "foo.bar.version": "2022-06-01T00:00:00.000000Z"
  },
  "relations": []
}
{
  "id": "4156f936-563a-42c5-b2a2-eae115f0d067",
  "name": "anvil_activity",
...

@nadove-ucsc
Copy link
Contributor Author

I have one other observation. The design of the verbatim handover says:

The .content property [of a replica] contains the verbatim metadata entity as supplied by the repository plugin.

It is not trivial to incorporate the descriptor into this design, regardless of how it is represented in the manifest. The metadata entities supplied by the repository plugin are passed to the metadata plugin via the metadata_files attribute of an HCABundle instance. As we populate this bundle, we dissolve each entity's descriptor into the bundle's synthetic manifest* entry, thus erasing the original column structure.

*The old meaning of manifest, not related to the handover

The metadata plugin could reconstruct most of the descriptor from the manifest fields, but that may result the omission of descriptor fields that don't have an analog in the manifest entry. A bigger problem is external DRS URI's: the manifest entry's drs_uri field may either come directly from the file_id column, or if it's "external", from a field in the descriptor column. Currently, there's no way to recover whether the DRS URI was external or not (and thus, which replica field it belongs in) after the entity is added to the bundle.

I suggest the spike be extended to come up with a design for how to address this problem.

@nadove-ucsc nadove-ucsc removed their assignment Jul 23, 2024
@dsotirho-ucsc
Copy link
Contributor

Assignee to consider next steps.

@hannes-ucsc
Copy link
Member

I suggest the spike be extended to come up with a design for how to address this problem.

Agreed. I think the solution to retaining the descriptor may be to pass it along in the bundle, as an additional entry so that no reconstruction is necessary.

I still like the partial unnesting I proposed above. Could you please try a dash - as the path separator?

@dsotirho-ucsc dsotirho-ucsc added spike:8 [process] Spike estimate of eight points and removed spike:3 [process] Spike estimate of three points labels Jul 29, 2024
@nadove-ucsc
Copy link
Contributor Author

Postponing this part of the spike:

I think the solution to retaining the descriptor may be to pass it along in the bundle, as an additional entry so that no reconstruction is necessary.

@nadove-ucsc
Copy link
Contributor Author

Could you please try a dash - as the path separator?

Manifest using - as a path separator imported successfully.

image

@nadove-ucsc
Copy link
Contributor Author

As of PR #6485, the following work remains outstanding for this issue:

Eliminate the notion of a "manifest" from HCA bundles. Replace the manifest entries with verbatim copies of file descriptors, and add a level of nesting to the "metadata" attribute so that the existing values are moved to a property called "content". For some entities, this will be the only property, but files will have two additional properties: "descriptor" and "file_id". The intent is for the properties of each value to correspond to BigQuery columns.

Once this restructuring is complete, we can include these additional properties in the replicas. The verbatim manifest generator will then partially flatten the replicas as explained #6299 (comment), using __ (two underscores) as a path separator.

Addendum: the stitched attribute added in this PR could also be folded into the properties of the metadata values, acting like a pseudo-column. If this is done, then the new, refactored structure of HCA bundles will very closely resemble the current structure of AnVIL bundles. It may be then be relatively low hanging fruit to eliminate the remaining inconsistencies' in HCABundle and AnvilBundle's interfaces, which could help streamline a lot of our testing code.

(copied from the PR (1) (2))

@hannes-ucsc hannes-ucsc added no demo [process] Not to be demonstrated at the end of the sprint and removed no demo [process] Not to be demonstrated at the end of the sprint labels Aug 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
- [priority] Medium bug [type] A defect preventing use of the system as specified debt [type] A defect incurring continued engineering cost manifests [subject] Generation and contents of manifests orange [process] Done by the Azul team spike:8 [process] Spike estimate of eight points
Projects
None yet
Development

No branches or pull requests

4 participants