-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HCA replicas lack DRS URI and descriptor #6299
Comments
Assignee to add reproduction using JSONL manifest, and BigQuery output from files table for reference. |
Here is an example of file metadata containing a DRS URI and descriptor, for this file:
With serialized JSON strings expanded in the
|
The JSONL manifest from filtering for only that file:
|
@hannes-ucsc: "Depending on the outcome of the on-going Slack discussion with the Terra import team, we could either produce just the |
Assignee to consider next steps. |
The issue is that we currently only include the value of the The completely flattened approach ( A partially flattened approach seems like the best compromise. Knowing the HCA schema, it appears that a flattening depth (number of dots in the column name) of 2 would be most useful. To avoid the array complexities described above, the flattening would have to be shallower when it encounters an array value. The resulting replica entity in the PFB would look as follows:
This approach is bijective (aka lossless aka reversible), doesn't lead to an excessive number of columns, solves this issue, and doesn't require inventing column names to avoid naming conflicts. |
Spike to review above comment and determine if dots are allowed in AvroPFB property names, and if those are retained on import, resulting in dots in workspace table column names. |
Terra does NOT allow dots in AvroPFB property names: I tested this using the following patch. I used an AnVIL manifest instead of HCA because it was easier to change the PFB schema. ee02daba2 (HEAD -> issues/nadove-ucsc/6299-hca-replicas-lack-drs-uri-descriptor) drop! Test support for . in PFB handover columns
diff --git a/src/azul/plugins/metadata/anvil/__init__.py b/src/azul/plugins/metadata/anvil/__init__.py
index 77fea54c3..c7e5f451f 100644
--- a/src/azul/plugins/metadata/anvil/__init__.py
+++ b/src/azul/plugins/metadata/anvil/__init__.py
@@ -304,6 +304,13 @@ class Plugin(MetadataPlugin[AnvilBundle]):
def verbatim_pfb_schema(self,
replicas: Iterable[JSON]
) -> tuple[Iterable[JSON], Sequence[str], JSON]:
+
+ def insert_dot(r):
+ r = r.copy()
+ r['contents'] = {'foo.bar.' + k: v for k, v in r['contents'].items()}
+ return r
+
+ return super().verbatim_pfb_schema(map(insert_dot, replicas))
entity_schemas = []
entity_types = []
for table_schema in sorted(anvil_schema['tables'], key=itemgetter('name')): Here is a preview of the manifest that failed to import (it contains only public entities):
|
I have one other observation. The design of the verbatim handover says:
It is not trivial to incorporate the descriptor into this design, regardless of how it is represented in the manifest. The metadata entities supplied by the repository plugin are passed to the metadata plugin via the *The old meaning of manifest, not related to the handover The metadata plugin could reconstruct most of the descriptor from the manifest fields, but that may result the omission of descriptor fields that don't have an analog in the manifest entry. A bigger problem is external DRS URI's: the manifest entry's I suggest the spike be extended to come up with a design for how to address this problem. |
Assignee to consider next steps. |
Agreed. I think the solution to retaining the descriptor may be to pass it along in the bundle, as an additional entry so that no reconstruction is necessary. I still like the partial unnesting I proposed above. Could you please try a dash |
Postponing this part of the spike:
|
As of PR #6485, the following work remains outstanding for this issue: Eliminate the notion of a "manifest" from HCA bundles. Replace the manifest entries with verbatim copies of file descriptors, and add a level of nesting to the "metadata" attribute so that the existing values are moved to a property called "content". For some entities, this will be the only property, but files will have two additional properties: "descriptor" and "file_id". The intent is for the properties of each value to correspond to BigQuery columns. Once this restructuring is complete, we can include these additional properties in the replicas. The verbatim manifest generator will then partially flatten the replicas as explained #6299 (comment), using Addendum: the |
#6299 (comment)
The text was updated successfully, but these errors were encountered: