HCA replicas lack DRS URI and descriptor #6299

nadove-ucsc · 2024-05-28T19:52:40Z

nadove-ucsc · 2024-05-28T19:54:11Z

Assignee to add reproduction using JSONL manifest, and BigQuery output from files table for reference.

nadove-ucsc · 2024-05-29T00:42:29Z

Here is an example of file metadata containing a DRS URI and descriptor, for this file:

select * from `datarepo-f4d7c97e.hca_prod_2ad191cdbd7a409b9bd1e72b5e4cce81__20220111_dcp2_20220114_dcp12.sequence_file`
limit 1

With serialized JSON strings expanded in the content and descriptor columns for legibility:

{
    "datarepo_row_id": "3a3f3042-672c-48d6-917e-ad7fede70096",
    "content": {
        "describedBy": "https://schema.humancellatlas.org/type/file/9.3.1/sequence_file",
        "schema_type": "file",
        "file_core": {
            "file_name": "SRR13509367_83942_MissingLibrary_1_HNJJVBBXX_R1.fastq.gz",
            "format": "fastq.gz",
            "content_description": [
                {
                    "text": "DNA sequence",
                    "ontology": "data:3494",
                    "ontology_label": "DNA sequence"
                }
            ]
        },
        "read_index": "read1",
        "lane_index": 9,
        "insdc_run_accessions": [
            "SRR13509367"
        ],
        "provenance": {
            "document_id": "74df138b-a13b-412c-a9c5-f80336e42c89",
            "submission_date": "2021-11-26T15:22:17.318Z",
            "update_date": "2021-11-29T10:27:18.731Z",
            "schema_major_version": 9,
            "schema_minor_version": 3
        }
    },
    "file_id": "drs://data.terra.bio/v1_aeb2bd6b-e22b-404e-b18a-ea478a6c3419_ee74e15b-107b-4065-adef-a54684cb0883",
    "version": "2021-11-26 15:22:17.318000 UTC",
    "sequence_file_id": "74df138b-a13b-412c-a9c5-f80336e42c89",
    "descriptor": {
        "file_id": "4e05daaa-13f4-4aa4-8f85-a8b716cdd3f8",
        "file_version": "2021-11-26T15:22:17.318000Z",
        "file_name": "SRR13509367_83942_MissingLibrary_1_HNJJVBBXX_R1.fastq.gz",
        "content_type": "binary/octet-stream; dcp-type=data",
        "size": 8981108947,
        "sha1": "7744af1664b2bc5145bf9ed8c5d7e3050185f8f9",
        "sha256": "561732126211c60d914bdeee6daadbd3f894663fc94f1cc695e26bcff2388d09",
        "crc32c": "9f52041d",
        "s3_etag": "4e8c0927af738182d64bd03b0f388896-134",
        "schema_type": "file_descriptor",
        "describedBy": "https://schema.humancellatlas.org/system/2.0.0/file_descriptor",
        "schema_version": "2.0.0"
    }
}

nadove-ucsc · 2024-05-29T00:49:36Z

The JSONL manifest from filtering for only that file:

curl -X 'PUT' \
  'https://service.azul.data.humancellatlas.org/fetch/manifest/files?catalog=dcp37&filters=%7B%0A%20%20%22fileName%22%3A%20%7B%0A%20%20%20%20%22is%22%3A%20%5B%0A%20%20%20%20%20%20%22SRR13509367_83942_MissingLibrary_1_HNJJVBBXX_R1.fastq.gz%22%0A%20%20%20%20%5D%0A%20%20%7D%0A%7D&format=verbatim.jsonl' \
  -H 'accept: application/json'

$ cat Desktop/hca-manifest-c88d5f47-dc61-5397-9172-54660398122d.87b5592e-c7c2-5da8-98fc-5c39d257c885.jsonl | jq 'select(.type == "file")'
{
  "contents": {
    "describedBy": "https://schema.humancellatlas.org/type/file/9.3.1/sequence_file",
    "schema_type": "file",
    "file_core": {
      "file_name": "SRR13509367_83942_MissingLibrary_1_HNJJVBBXX_R1.fastq.gz",
      "format": "fastq.gz",
      "content_description": [
        {
          "text": "DNA sequence",
          "ontology": "data:3494",
          "ontology_label": "DNA sequence"
        }
      ]
    },
    "read_index": "read1",
    "lane_index": 9,
    "insdc_run_accessions": [
      "SRR13509367"
    ],
    "provenance": {
      "document_id": "74df138b-a13b-412c-a9c5-f80336e42c89",
      "submission_date": "2021-11-26T15:22:17.318Z",
      "update_date": "2021-11-29T10:27:18.731Z",
      "schema_major_version": 9,
      "schema_minor_version": 3
    }
  },
  "type": "file"
}

dsotirho-ucsc · 2024-05-29T18:53:16Z

@hannes-ucsc: "Depending on the outcome of the on-going Slack discussion with the Terra import team, we could either produce just the content, descriptor and file_id columns (for file tables, just content for other tables), or completely flatten the metadata into a dictionary where the keys are dotted paths and the values are arrays. We could also offer two different flavors of the verbatim PFB manifest."

dsotirho-ucsc · 2024-05-29T18:53:28Z

Assignee to consider next steps.

hannes-ucsc · 2024-07-12T17:01:22Z

The issue is that we currently only include the value of the contents column from the snapshot's BQ table. To include the file_id and descriptor columns we would have to invent column names for them that don't conflict with any top-level property names from the HCA schema. That seems arbitrary and brittle.

The completely flattened approach ("file_core.content_description.text": "DNA sequence") would be awkward to use in case of values that are arrays . The column name, essentially a JSON path, would have to include the [] notation, leading to column names not known in advance, or the column values would have to be arrays, requiring users to correlate array indices (the n-th value in column a.b.c corresponds to the n-th value in column a.b.c). This would be further complicated by the question of handling arrays whose values are dictionaries containing dictionaries or arrays.

A partially flattened approach seems like the best compromise. Knowing the HCA schema, it appears that a flattening depth (number of dots in the column name) of 2 would be most useful. To avoid the array complexities described above, the flattening would have to be shallower when it encounters an array value. The resulting replica entity in the PFB would look as follows:

{
    "datarepo_row_id": "3a3f3042-672c-48d6-917e-ad7fede70096",
    "file_id": "drs://data.terra.bio/v1_aeb2bd6b-e22b-404e-b18a-ea478a6c3419_ee74e15b-107b-4065-adef-a54684cb0883",
    "content.describedBy": "https://schema.humancellatlas.org/type/file/9.3.1/sequence_file",
    "content.schema_type": "file",
    "content.file_core.file_name": "SRR13509367_83942_MissingLibrary_1_HNJJVBBXX_R1.fastq.gz",
    "content.file_core.format": "fastq.gz",
    "content.file_core.content_description": [
        {
            "text": "DNA sequence",
            "ontology": "data:3494",
            "ontology_label": "DNA sequence"
        }
    ],
    "content.read_index": "read1",
    "content.lane_index": 9,
    "content.insdc_run_accessions": [
        "SRR13509367"
    ],
    "content.provenance.document_id": "74df138b-a13b-412c-a9c5-f80336e42c89",
    "content.provenance.submission_date": "2021-11-26T15:22:17.318Z",
    "content.provenance.update_date": "2021-11-29T10:27:18.731Z",
    "content.provenance.schema_major_version": 9,
    "content.provenance.schema_minor_version": 3,
    "descriptor.file_id": "4e05daaa-13f4-4aa4-8f85-a8b716cdd3f8",
    "descriptor.file_version": "2021-11-26T15:22:17.318000Z",
    "descriptor.file_name": "SRR13509367_83942_MissingLibrary_1_HNJJVBBXX_R1.fastq.gz",
    "descriptor.content_type": "binary/octet-stream; dcp-type=data",
    "descriptor.size": 8981108947,
    "descriptor.sha1": "7744af1664b2bc5145bf9ed8c5d7e3050185f8f9",
    "descriptor.sha256": "561732126211c60d914bdeee6daadbd3f894663fc94f1cc695e26bcff2388d09",
    "descriptor.crc32c": "9f52041d",
    "descriptor.s3_etag": "4e8c0927af738182d64bd03b0f388896-134",
    "descriptor.schema_type": "file_descriptor",
    "descriptor.describedBy": "https://schema.humancellatlas.org/system/2.0.0/file_descriptor",
    "descriptor.schema_version": "2.0.0"
}

This approach is bijective (aka lossless aka reversible), doesn't lead to an excessive number of columns, solves this issue, and doesn't require inventing column names to avoid naming conflicts.

hannes-ucsc · 2024-07-12T17:07:07Z

Spike to review above comment and determine if dots are allowed in AvroPFB property names, and if those are retained on import, resulting in dots in workspace table column names.

nadove-ucsc · 2024-07-23T05:55:13Z

Terra does NOT allow dots in AvroPFB property names:

Workspace link

I tested this using the following patch. I used an AnVIL manifest instead of HCA because it was easier to change the PFB schema.

ee02daba2 (HEAD -> issues/nadove-ucsc/6299-hca-replicas-lack-drs-uri-descriptor) drop! Test support for . in PFB handover columns
diff --git a/src/azul/plugins/metadata/anvil/__init__.py b/src/azul/plugins/metadata/anvil/__init__.py
index 77fea54c3..c7e5f451f 100644
--- a/src/azul/plugins/metadata/anvil/__init__.py
+++ b/src/azul/plugins/metadata/anvil/__init__.py
@@ -304,6 +304,13 @@ class Plugin(MetadataPlugin[AnvilBundle]):
     def verbatim_pfb_schema(self,
                             replicas: Iterable[JSON]
                             ) -> tuple[Iterable[JSON], Sequence[str], JSON]:
+
+        def insert_dot(r):
+            r = r.copy()
+            r['contents'] = {'foo.bar.' + k: v for k, v in r['contents'].items()}
+            return r
+
+        return super().verbatim_pfb_schema(map(insert_dot, replicas))
         entity_schemas = []
         entity_types = []
         for table_schema in sorted(anvil_schema['tables'], key=itemgetter('name')):

Here is a preview of the manifest that failed to import (it contains only public entities):

$ pfb show -i Desktop/anvil-manifest-d25978db-c07b-59c7-942b-398ac4130346.ad14b32b-dea0-555e-b745-8ebe5bb0958d.avro  | jq . | head -50
{
  "id": "636f608f-459e-46ec-bad2-13ad1867686a",
  "name": "anvil_file",
  "object": {
    "foo.bar.crc32": "",
    "foo.bar.data_modality": [],
    "foo.bar.datarepo_row_id": "636f608f-459e-46ec-bad2-13ad1867686a",
    "foo.bar.drs_uri": "drs://jade.datarepo-dev.broadinstitute.org/v1_790795c4-49b1-4ac8-a060-207b92ea08c5_3c387d5a-69e3-4bd4-abd5-283a09a8ad09",
    "foo.bar.file_format": ".crai",
    "foo.bar.file_id": "3c387d5a-69e3-4bd4-abd5-283a09a8ad09",
    "foo.bar.file_md5sum": "To/CtLVpMUPpulH7wLrNOA==",
    "foo.bar.file_name": "NA19239.final.cram.crai",
    "foo.bar.file_ref": "drs://jade.datarepo-dev.broadinstitute.org/v1_790795c4-49b1-4ac8-a060-207b92ea08c5_3c387d5a-69e3-4bd4-abd5-283a09a8ad09",
    "foo.bar.file_size": 1324028,
    "foo.bar.is_supplementary": false,
    "foo.bar.reference_assembly": [],
    "foo.bar.sha256": "",
    "foo.bar.source_datarepo_row_ids": [
      "file_inventory:34d54283-56b1-4d99-8994-20da842dd8cc"
    ],
    "foo.bar.version": "2022-06-01T00:00:00.000000Z"
  },
  "relations": []
}
{
  "id": "4156f936-563a-42c5-b2a2-eae115f0d067",
  "name": "anvil_activity",
...

nadove-ucsc · 2024-07-23T06:25:08Z

I have one other observation. The design of the verbatim handover says:

The .content property [of a replica] contains the verbatim metadata entity as supplied by the repository plugin.

It is not trivial to incorporate the descriptor into this design, regardless of how it is represented in the manifest. The metadata entities supplied by the repository plugin are passed to the metadata plugin via the metadata_files attribute of an HCABundle instance. As we populate this bundle, we dissolve each entity's descriptor into the bundle's synthetic manifest* entry, thus erasing the original column structure.

_{*The old meaning of manifest, not related to the handover}

The metadata plugin could reconstruct most of the descriptor from the manifest fields, but that may result the omission of descriptor fields that don't have an analog in the manifest entry. A bigger problem is external DRS URI's: the manifest entry's drs_uri field may either come directly from the file_id column, or if it's "external", from a field in the descriptor column. Currently, there's no way to recover whether the DRS URI was external or not (and thus, which replica field it belongs in) after the entity is added to the bundle.

I suggest the spike be extended to come up with a design for how to address this problem.

dsotirho-ucsc · 2024-07-23T19:06:45Z

Assignee to consider next steps.

hannes-ucsc · 2024-07-26T01:53:26Z

I suggest the spike be extended to come up with a design for how to address this problem.

Agreed. I think the solution to retaining the descriptor may be to pass it along in the bundle, as an additional entry so that no reconstruction is necessary.

I still like the partial unnesting I proposed above. Could you please try a dash - as the path separator?

nadove-ucsc · 2024-07-29T22:15:46Z

Postponing this part of the spike:

I think the solution to retaining the descriptor may be to pass it along in the bundle, as an additional entry so that no reconstruction is necessary.

nadove-ucsc · 2024-07-30T01:45:40Z

Could you please try a dash - as the path separator?

Manifest using - as a path separator imported successfully.

nadove-ucsc · 2024-08-30T01:22:22Z

As of PR #6485, the following work remains outstanding for this issue:

Eliminate the notion of a "manifest" from HCA bundles. Replace the manifest entries with verbatim copies of file descriptors, and add a level of nesting to the "metadata" attribute so that the existing values are moved to a property called "content". For some entities, this will be the only property, but files will have two additional properties: "descriptor" and "file_id". The intent is for the properties of each value to correspond to BigQuery columns.

Once this restructuring is complete, we can include these additional properties in the replicas. The verbatim manifest generator will then partially flatten the replicas as explained #6299 (comment), using __ (two underscores) as a path separator.

Addendum: the stitched attribute added in this PR could also be folded into the properties of the metadata values, acting like a pseudo-column. If this is done, then the new, refactored structure of HCA bundles will very closely resemble the current structure of AnVIL bundles. It may be then be relatively low hanging fruit to eliminate the remaining inconsistencies' in HCABundle and AnvilBundle's interfaces, which could help streamline a lot of our testing code.

(copied from the PR (1) (2))

nadove-ucsc added the orange [process] Done by the Azul team label May 28, 2024

nadove-ucsc self-assigned this May 28, 2024

nadove-ucsc removed their assignment May 29, 2024

dsotirho-ucsc assigned hannes-ucsc May 29, 2024

hannes-ucsc added the spike:2 [process] Spike estimate of two points label Jul 12, 2024

hannes-ucsc removed their assignment Jul 12, 2024

achave11-ucsc assigned nadove-ucsc Jul 12, 2024

achave11-ucsc added spike:3 [process] Spike estimate of three points and removed spike:2 [process] Spike estimate of two points labels Jul 12, 2024

nadove-ucsc removed their assignment Jul 23, 2024

dsotirho-ucsc assigned hannes-ucsc Jul 23, 2024

dsotirho-ucsc assigned nadove-ucsc and unassigned hannes-ucsc Jul 29, 2024

dsotirho-ucsc added spike:8 [process] Spike estimate of eight points and removed spike:3 [process] Spike estimate of three points labels Jul 29, 2024

nadove-ucsc assigned hannes-ucsc and unassigned hannes-ucsc Jul 29, 2024

nadove-ucsc added a commit that referenced this issue Aug 23, 2024

[p] Convert HCABundle.manifest to dictionary (partial #6299)

7cf9eb2

nadove-ucsc added a commit that referenced this issue Aug 24, 2024

[p] Convert HCABundle.manifest to dictionary (partial #6299)

eac60f0

nadove-ucsc added a commit that referenced this issue Aug 27, 2024

[p] Extract is_stitched from manifest entries (partial #6299)

29214a9

nadove-ucsc added a commit that referenced this issue Aug 27, 2024

[p] Remove manifest entries for HCA metadata entities (partial #6299)

bf254f2

nadove-ucsc added a commit that referenced this issue Aug 27, 2024

[p] Convert HCABundle.manifest to dictionary (partial #6299)

3a7e66a

nadove-ucsc added a commit that referenced this issue Aug 27, 2024

fixup! [p] Convert HCABundle.manifest to dictionary (partial #6299)

0f9f540

nadove-ucsc added a commit that referenced this issue Aug 29, 2024

[p] Extract is_stitched from manifest entries (partial #6299)

8779107

nadove-ucsc added a commit that referenced this issue Aug 29, 2024

[p] Remove manifest entries for HCA metadata entities (partial #6299)

a734b89

nadove-ucsc added a commit that referenced this issue Aug 29, 2024

[p] Convert HCABundle.manifest to dictionary (partial #6299)

a084c0c

hannes-ucsc pushed a commit that referenced this issue Aug 29, 2024

[p] Extract is_stitched from manifest entries (partial #6299)

0e80c82

hannes-ucsc pushed a commit that referenced this issue Aug 29, 2024

[p] Remove manifest entries for HCA metadata entities (partial #6299)

87ee42a

hannes-ucsc pushed a commit that referenced this issue Aug 29, 2024

[p] Convert HCABundle.manifest to dictionary (partial #6299)

03233ae

nadove-ucsc added a commit that referenced this issue Aug 30, 2024

[p] Extract is_stitched from manifest entries (partial #6299)

ada3983

nadove-ucsc added a commit that referenced this issue Aug 30, 2024

fixup! [p] Extract is_stitched from manifest entries (partial #6299)

0efbbe9

nadove-ucsc added a commit that referenced this issue Aug 30, 2024

[p] Remove manifest entries for HCA metadata entities (partial #6299)

21cb628

nadove-ucsc added a commit that referenced this issue Aug 30, 2024

[p] Convert HCABundle.manifest to dictionary (partial #6299)

5a199b8

nadove-ucsc added a commit that referenced this issue Aug 30, 2024

[p] Extract is_stitched from manifest entries (partial #6299)

c2b864c

nadove-ucsc added a commit that referenced this issue Aug 30, 2024

fixup! [p] Extract is_stitched from manifest entries (partial #6299)

8f28042

nadove-ucsc added a commit that referenced this issue Aug 30, 2024

[p] Remove manifest entries for HCA metadata entities (partial #6299)

898dda5

nadove-ucsc added a commit that referenced this issue Aug 30, 2024

[p] Convert HCABundle.manifest to dictionary (partial #6299)

7ca6d2c

hannes-ucsc added no demo [process] Not to be demonstrated at the end of the sprint and removed no demo [process] Not to be demonstrated at the end of the sprint labels Aug 30, 2024

achave11-ucsc pushed a commit that referenced this issue Sep 3, 2024

[p] Extract is_stitched from manifest entries (partial #6299)

0069fdd

achave11-ucsc pushed a commit that referenced this issue Sep 3, 2024

[p] Remove manifest entries for HCA metadata entities (partial #6299)

cc88892

achave11-ucsc pushed a commit that referenced this issue Sep 3, 2024

[p] Convert HCABundle.manifest to dictionary (partial #6299)

4c70d10

achave11-ucsc added a commit that referenced this issue Sep 3, 2024

Restructure HCA bundles (#6491, #4565, partial #6299, PR #6485)

42cb07d

dsotirho-ucsc assigned hannes-ucsc and unassigned nadove-ucsc Sep 18, 2024

nadove-ucsc mentioned this issue Sep 19, 2024

Manifest entry not generated for links.json by StagingArea.get_bundle #4565

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HCA replicas lack DRS URI and descriptor #6299

HCA replicas lack DRS URI and descriptor #6299

nadove-ucsc commented May 28, 2024 •

edited by hannes-ucsc

Loading

nadove-ucsc commented May 28, 2024

nadove-ucsc commented May 29, 2024 •

edited

Loading

nadove-ucsc commented May 29, 2024

dsotirho-ucsc commented May 29, 2024

dsotirho-ucsc commented May 29, 2024

hannes-ucsc commented Jul 12, 2024 •

edited by nadove-ucsc

Loading

hannes-ucsc commented Jul 12, 2024 •

edited by achave11-ucsc

Loading

nadove-ucsc commented Jul 23, 2024

nadove-ucsc commented Jul 23, 2024

dsotirho-ucsc commented Jul 23, 2024

hannes-ucsc commented Jul 26, 2024

nadove-ucsc commented Jul 29, 2024

nadove-ucsc commented Jul 30, 2024

nadove-ucsc commented Aug 30, 2024

HCA replicas lack DRS URI and descriptor #6299

HCA replicas lack DRS URI and descriptor #6299

Comments

nadove-ucsc commented May 28, 2024 • edited by hannes-ucsc Loading

nadove-ucsc commented May 28, 2024

nadove-ucsc commented May 29, 2024 • edited Loading

nadove-ucsc commented May 29, 2024

dsotirho-ucsc commented May 29, 2024

dsotirho-ucsc commented May 29, 2024

hannes-ucsc commented Jul 12, 2024 • edited by nadove-ucsc Loading

hannes-ucsc commented Jul 12, 2024 • edited by achave11-ucsc Loading

nadove-ucsc commented Jul 23, 2024

nadove-ucsc commented Jul 23, 2024

dsotirho-ucsc commented Jul 23, 2024

hannes-ucsc commented Jul 26, 2024

nadove-ucsc commented Jul 29, 2024

nadove-ucsc commented Jul 30, 2024

nadove-ucsc commented Aug 30, 2024

nadove-ucsc commented May 28, 2024 •

edited by hannes-ucsc

Loading

nadove-ucsc commented May 29, 2024 •

edited

Loading

hannes-ucsc commented Jul 12, 2024 •

edited by nadove-ucsc

Loading

hannes-ucsc commented Jul 12, 2024 •

edited by achave11-ucsc

Loading