-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve usability of the compact manifest for HCA #3537
Comments
@theathorn add details from spreadsheet. |
@hannes-ucsc to review description. |
Don't understand the syntax |
I spoke with Parisa Nejad about this issue.
The wranglers would need to do more work to define a comprehensive list of the individual required fields. Why do some column titles in the manifest (e.g. Need to document the meaning of all internally generated fields (e.g. |
@theathorn Should also add |
The "Phase 1" plan is to submit the Ingest metadata spreadsheet to TDR with some associated metadata so that it can be presented for download in the Data Browser alongside the project (compact) manifest. "Phase 2" is to work out the full solution (e.g. JSON files or ???). As part of the Phase 1 work let's look at which simple set of high priority changes we can make to improve the usability of the compact manifest. |
@hannes-ucsc to reply to the individual points with respect to what is feasible for Phase 1 and what needs to be postponed to Phase 2. |
This is incidental. It correlates with Ingest's decision to define a subgraph for each sequencing process. This can change and we shouldn't depend on this incidental correspondence.
I don't think this is universally true. To be verified.
This is not true. The document ID is allocated by Ingest. The biomaterial_id is filled in by the wrangler. They are usually not the same. To be verified.
Same. To be verified.
Same. To be verified.
Same. To be verified.
This field is the subgraph ID aka links_id. It is crucially important.
This field is the subgraph version aka links version. It is crucially important.
Is the version of the data file. It is crucially important for determining if a file changed.
We don't index these. Would be difficult to add. Implement as part of phase II.
We don't index these. Seems important enough to add for phase I.
Won't fix. Since these are at the project level it would be highly duplicative. Many rows, or, if the manifest is for a single project, all rows, would contain the same values. The values tend to be large, aggravating the issue. If someone wants these values, they can use the /index/project endpoint.
We don't index this. Seems important enough to add for phase I.
We index this. Include in manifest.
Not sure if we index this. Add for phase I.
We index this. Add for phase I.
This field is badly modeled. A cell line shouldn't have a taxon ID. Only whole organisms (donors) are of a species and the cell line should be linked to a donor organism. HumanCellAtlas/metadata-schema#1415
See cell_line.biomaterial_core.ncbi_taxon_id
We index this. Add for phase I.
See cell_line.biomaterial_core.ncbi_taxon_id
See cell_line.biomaterial_core.ncbi_taxon_id
We index these and include them in the manifest, just generically as file_name, content_description and file_format. To be verified.
Not sure if we index these. We should. Add for phase I.
See other imaging fields below.
See other project fields above.
There are no projects with imaging data. It would be pointless to add these fields.
This is self-explanatory. Every entity has .provenance.document_id. It should be clear that this field is it's ID. When looking at this value, it is obvious that it is a UUID.
I believe we already index this and put it in the manifest. Just as a generically named colum file_read_index. To be verified. |
Spike again to verify my claims (look for "To be verified"). |
Legend:
✅
✅
✅
✅ Caveat with "sample": Since "sample" is a virtual type assigned to downstream biomaterials (specimen, organoid, or cell line) during index time, the
✅
✅ Note: Field is currently only indexed for
✅ Field is currently indexed.
✅ Note: With ontologies we index
❌ We do not currently index the
✅
❌ Field is not currently in the compact manifest. |
@hannes-ucsc : "I will review Daniel's findings and come up with a breakdown of tickets, and make this an epic of those tickets." |
De-prioritizing given the changed priorities WRT AnVIL. |
Unlike AnVIL, HCA has an order of magnitude more metadata properties. Due to the sheer number of properties that are being requested for inclusion, this can't be a use-case for the compact manifest. The verbatim JSONL manifest (part of epic #6121) we're working on will ultimately include all properties of all entities. We should take another look at determining a curated list of important properties. |
I haven't heard this discussed recently within the HCA team. As such, I'm going to assign it a low priority. |
From Parisa Nejad's current metadata spreadsheet:
Do these columns contain true duplicated values?
Non-useful fields?
Missing critical fields:
Missing required fields:
Document the columns included in the manifest:
Consolidating from #2781:
User Story: As an end-user that wants to use the HCA data I want to understand how fastq files are grouped together and understand the distribution of the barcodes in the files so that I can run my analysis pipelines automatically, extracting the information from the metadata:
library_preparation_protocol.cell_barcode.barcode_read
library_preparation_protocol.cell_barcode.barcode_offset
library_preparation_protocol.cell_barcode.barcode_length
library_preparation_protocol.cell_barcode.white_list_file
library_preparation_protocol.umi_barcode.barcode_read
library_preparation_protocol.umi_barcode.barcode_offset
library_preparation_protocol.umi_barcode.barcode_length
library_preparation_protocol.umi_barcode.white_list_file
sequence_file.lane_index
The text was updated successfully, but these errors were encountered: