Improve usability of the compact manifest for HCA #3537

theathorn · 2021-10-14T18:26:54Z

From Parisa Nejad's current metadata spreadsheet:

Possible duplicated values are color-code on the metadata tab
Remove non-useful fields (e.g. "bundle_uuid")
Add useful fields listed in the "critical" and "missing required fields" tabs
Don't include empty columns in the manifest ("example" tab does contain empty columns, e.g. "specimen_from_organism.diseases")
Document the columns included in the manifest

Do these columns contain true duplicated values?

bundle_uuid, sequencing_process.provenance.document_id
file_document_id, file_uuid
cell_suspension.provenance.document_id, cell_suspension.biomaterial_core.biomaterial_id
donor_organism.biomaterial_core.biomaterial_id, donor_organism.provenance.document_id
specimen_from_organism.provenance.document_id, sample.provenance.document_id, sample.biomaterial_core.biomaterial_id
sequencing_input.provenance.document_id, sequencing_input.biomaterial_core.biomaterial_id

Non-useful fields?

bundle_uuid
bundle_version
file_version

Missing critical fields:

cell_suspension.plate_based_sequencing.plate_label
cell_suspension.plate_based_sequencing.well_label
cell_suspension.plate_based_sequencing.well_quality
cell_suspension.timecourse.value
cell_suspension.timecourse.unit.text
donor_organism.gestational_age
donor_organism.is_living
donor_organism.human_specific.ethnicity.text
donor_organism.mouse_specific.strain.text
project.array_express_accessions
project.biostudies_accessions
project.dbgap_accessions
project.ega_accessions
project.geo_series_accessions
project.insdc_project_accessions
project.insdc_study_accessions
project.funders.grant_id
project.funders.organization
project_core.project_description
project_core.project_title
protocol.type.text
protocol.protocol_core.protocol_id

Missing required fields:

cell_line.type
cell_line.model_organ.text
cell_line.biomaterial_core.ncbi_taxon_id
cell_line.cell_cycle.text
cell_suspension.selected_cell_types.cell_type_ontology.text
cell_suspension.biomaterial_core.ncbi_taxon_id
donor_organism.biomaterial_core.ncbi_taxon_id
organoid.biomaterial_core.ncbi_taxon_id
specimen_from_organism.biomaterial_core.ncbi_taxon_id
analysis_file.file_core.file_name
analysis_file.file_core.format
analysis_file.file_core.content_description.text
image_file.file_core.file_name
image_file.file_core.format
image_file.file_core.content_description.text
reference_file.assembly_type
reference_file.ncbi_taxon_id
reference_file.reference_type
reference_file.reference_version
reference_file.genus_species.text
reference_file.file_core.file_name
reference_file.file_core.format
reference_file.file_core.content_description.text
sequence_file.file_core.file_name
sequence_file.file_core.format
sequence_file.file_core.content_description.text
supplementary_file.file_core.file_name
supplementary_file.file_core.format
supplementary_file.file_core.content_description.text
library_preparation_protocol.end_bias
library_preparation_protocol.strand
library_preparation_protocol.input_nucleic_acid_molecule.text
sequencing_protocol.method.text
donor_organism.death.cause_of_death
library_preparation_protocol.input_nucleic_acid_molecule.text
imaging_protocol.probe.subcellular_structure.cellular_component_ontology.text
project.contributors.project_role.text
project.contributors.name
project.publications.publication.authors
project.publications.publication.official_hca_publication
project.publications.publication.title
imaging_protocol.channel.channel_id
imaging_protocol.channel.excitation_wavelength
imaging_protocol.channel.exposure_time
imaging_protocol.channel.filter_range
imaging_protocol.channel.multiplexed
imaging_protocol.channel.target_fluorophore
imaging_protocol.probe.channel_label
imaging_protocol.probe.fluorophore
imaging_protocol.probe.probe_label
imaging_protocol.probe.probe_sequence
imaging_protocol.probe.target_codebook_label
imaging_protocol.probe.target_label
imaging_protocol.probe.target_name
imaging_protocol.probe.assay_type.text
imaging_protocol.probe.probe_reagents.catalog_number
imaging_protocol.probe.probe_reagents.expiry_date
imaging_protocol.probe.probe_reagents.kit_titer
imaging_protocol.probe.probe_reagents.lot_number
imaging_protocol.probe.probe_reagents.manufacturer
imaging_protocol.probe.probe_reagents.retail_name
imaging_protocol.probe.subcellular_structure.text

Document the columns included in the manifest:

e.g. "project.provenance.document_id" is the project UUID

Consolidating from #2781:
User Story: As an end-user that wants to use the HCA data I want to understand how fastq files are grouped together and understand the distribution of the barcodes in the files so that I can run my analysis pipelines automatically, extracting the information from the metadata:

library_preparation_protocol.cell_barcode.barcode_read
library_preparation_protocol.cell_barcode.barcode_offset
library_preparation_protocol.cell_barcode.barcode_length
library_preparation_protocol.cell_barcode.white_list_file
library_preparation_protocol.umi_barcode.barcode_read
library_preparation_protocol.umi_barcode.barcode_offset
library_preparation_protocol.umi_barcode.barcode_length
library_preparation_protocol.umi_barcode.white_list_file
sequence_file.lane_index

The text was updated successfully, but these errors were encountered:

melainalegaspi · 2021-10-14T19:01:24Z

@theathorn add details from spreadsheet.

melainalegaspi · 2021-10-18T19:43:20Z

@hannes-ucsc to review description.

hannes-ucsc · 2021-10-19T00:51:15Z

Don't understand the syntax Donor organism>Mouse-specific that is used in some of the bullet points in the above description. Clarify in PL.

theathorn · 2021-10-19T20:57:44Z

I spoke with Parisa Nejad about this issue.

Donor organism>Mouse-specific refers to an entity in the metadata dictionary.

The wranglers would need to do more work to define a comprehensive list of the individual required fields.
However, Parisa thought that making the submitted metadata spreadsheet available for download (alongside the compact manifest) would allow for minimal changes to the manifest.

Why do some column titles in the manifest (e.g. file_sha256) not correspond to actual metadata schemas definitions (is it the same as file_descriptor.sha256)?

Need to document the meaning of all internally generated fields (e.g. file_drs_uri) in the manifest.

dsotirho-ucsc · 2021-10-19T21:08:46Z

@theathorn Should also add submissionDate, updateDate, aggregateSubmissionDate, aggregateUpdateDate.

theathorn · 2021-10-27T19:25:50Z

The "Phase 1" plan is to submit the Ingest metadata spreadsheet to TDR with some associated metadata so that it can be presented for download in the Data Browser alongside the project (compact) manifest.

"Phase 2" is to work out the full solution (e.g. JSON files or ???).

As part of the Phase 1 work let's look at which simple set of high priority changes we can make to improve the usability of the compact manifest.

melainalegaspi · 2021-11-01T19:20:00Z

@hannes-ucsc to reply to the individual points with respect to what is feasible for Phase 1 and what needs to be postponed to Phase 2.

hannes-ucsc · 2021-11-05T04:32:32Z

Do these columns contain true duplicated values

bundle_uuid, sequencing_process.provenance.document_id

This is incidental. It correlates with Ingest's decision to define a subgraph for each sequencing process. This can change and we shouldn't depend on this incidental correspondence.

file_document_id, file_uuid

I don't think this is universally true. To be verified.

cell_suspension.provenance.document_id, cell_suspension.biomaterial_core.biomaterial_id

This is not true. The document ID is allocated by Ingest. The biomaterial_id is filled in by the wrangler. They are usually not the same. To be verified.

donor_organism.biomaterial_core.biomaterial_id, donor_organism.provenance.document_id

Same. To be verified.

specimen_from_organism.provenance.document_id, sample.provenance.document_id, sample.biomaterial_core.biomaterial_id

Same. To be verified.

sequencing_input.provenance.document_id, sequencing_input.biomaterial_core.biomaterial_id

Same. To be verified.

Non-useful fields

bundle_uuid

This field is the subgraph ID aka links_id. It is crucially important.

bundle_version

This field is the subgraph version aka links version. It is crucially important.

file_version>

Is the version of the data file. It is crucially important for determining if a file changed.

Missing critical fields

cell_suspension.plate_based_sequencing.plate_label

cell_suspension.plate_based_sequencing.well_label

cell_suspension.plate_based_sequencing.well_quality

cell_suspension.timecourse.value

cell_suspension.timecourse.unit.text

We don't index these. Would be difficult to add. Implement as part of phase II.

donor_organism.gestational_age

donor_organism.is_living

donor_organism.human_specific.ethnicity.text

donor_organism.mouse_specific.strain.text

We don't index these. Seems important enough to add for phase I.

project.array_express_accessions

project.biostudies_accessions

project.dbgap_accessions

project.ega_accessions

project.geo_series_accessions

project.insdc_project_accessions

project.insdc_study_accessions

project.funders.grant_id

project.funders.organization

project_core.project_description

project_core.project_title

Won't fix. Since these are at the project level it would be highly duplicative. Many rows, or, if the manifest is for a single project, all rows, would contain the same values. The values tend to be large, aggravating the issue. If someone wants these values, they can use the /index/project endpoint.

protocol.type.text

We don't index this. Seems important enough to add for phase I.

protocol.protocol_core.protocol_id

We index this. Include in manifest.

Missing required fields:>

cell_line.type

Not sure if we index this. Add for phase I.

cell_line.model_organ.text

We index this. Add for phase I.

cell_line.biomaterial_core.ncbi_taxon_id

This field is badly modeled. A cell line shouldn't have a taxon ID. Only whole organisms (donors) are of a species and the cell line should be linked to a donor organism.

HumanCellAtlas/metadata-schema#1415

cell_line.cell_cycle.text

cell_suspension.selected_cell_types.cell_type_ontology.text

cell_suspension.biomaterial_core.ncbi_taxon_id

See cell_line.biomaterial_core.ncbi_taxon_id

donor_organism.biomaterial_core.ncbi_taxon_id

We index this. Add for phase I.

organoid.biomaterial_core.ncbi_taxon_id

See cell_line.biomaterial_core.ncbi_taxon_id

specimen_from_organism.biomaterial_core.ncbi_taxon_id

See cell_line.biomaterial_core.ncbi_taxon_id

analysis_file.file_core.file_name

analysis_file.file_core.format

analysis_file.file_core.content_description.text

image_file.file_core.file_name

image_file.file_core.format

image_file.file_core.content_description.text

reference_file.assembly_type

reference_file.ncbi_taxon_id

reference_file.reference_type

reference_file.reference_version

reference_file.genus_species.text

reference_file.file_core.file_name

reference_file.file_core.format

reference_file.file_core.content_description.text

sequence_file.file_core.file_name

sequence_file.file_core.format

sequence_file.file_core.content_description.text

supplementary_file.file_core.file_name

supplementary_file.file_core.format

supplementary_file.file_core.content_description.text

We index these and include them in the manifest, just generically as file_name, content_description and file_format. To be verified.

library_preparation_protocol.end_bias

library_preparation_protocol.strand

library_preparation_protocol.input_nucleic_acid_molecule.text

Not sure if we index these. We should. Add for phase I.

sequencing_protocol.method.text

donor_organism.death.cause_of_death

library_preparation_protocol.input_nucleic_acid_molecule.text

imaging_protocol.probe.subcellular_structure.cellular_component_ontology.text

See other imaging fields below.

project.contributors.project_role.text

project.contributors.name

project.publications.publication.authors

project.publications.publication.official_hca_publication

project.publications.publication.title

See other project fields above.

imaging_protocol.channel.channel_id

imaging_protocol.channel.excitation_wavelength

imaging_protocol.channel.exposure_time

imaging_protocol.channel.filter_range

imaging_protocol.channel.multiplexed

imaging_protocol.channel.target_fluorophore

imaging_protocol.probe.channel_label

imaging_protocol.probe.fluorophore

imaging_protocol.probe.probe_label

imaging_protocol.probe.probe_sequence

imaging_protocol.probe.target_codebook_label

imaging_protocol.probe.target_label

imaging_protocol.probe.target_name

imaging_protocol.probe.assay_type.text

imaging_protocol.probe.probe_reagents.catalog_number

imaging_protocol.probe.probe_reagents.expiry_date

imaging_protocol.probe.probe_reagents.kit_titer

imaging_protocol.probe.probe_reagents.lot_number

imaging_protocol.probe.probe_reagents.manufacturer

imaging_protocol.probe.probe_reagents.retail_name

imaging_protocol.probe.subcellular_structure.text>

There are no projects with imaging data. It would be pointless to add these fields.

Document the columns included in the manifest:

e.g. "project.provenance.document_id" is the project UUID

This is self-explanatory. Every entity has .provenance.document_id. It should be clear that this field is it's ID. When looking at this value, it is obvious that it is a UUID.

Consolidating from #2781:
User Story: As an end-user that wants to use the HCA data I want to understand how fastq files are grouped together and understand the distribution of the barcodes in the files so that I can run my analysis pipelines automatically, extracting the information from the metadata:

library_preparation_protocol.cell_barcode.barcode_read

library_preparation_protocol.cell_barcode.barcode_offset

library_preparation_protocol.cell_barcode.barcode_length

library_preparation_protocol.cell_barcode.white_list_file

library_preparation_protocol.umi_barcode.barcode_read

library_preparation_protocol.umi_barcode.barcode_offset

library_preparation_protocol.umi_barcode.barcode_length

library_preparation_protocol.umi_barcode.white_list_file

sequence_file.lane_index

I believe we already index this and put it in the manifest. Just as a generically named colum file_read_index. To be verified.

hannes-ucsc · 2021-11-05T04:33:04Z

Spike again to verify my claims (look for "To be verified").

dsotirho-ucsc · 2021-11-22T19:24:02Z

Legend:

✅ = Comment was verified
❌ = Comment needs correction
⚠️ = Action item

Do these columns contain true duplicated values

file_document_id, file_uuid

I don't think this is universally true. To be verified.

✅ file_document_id is .provenance.document_id from the metadata, file_uuid is .file_id from the descriptor.

cell_suspension.provenance.document_id, cell_suspension.biomaterial_core.biomaterial_id

This is not true. The document ID is allocated by Ingest. The biomaterial_id is filled in by the wrangler. They are usually not the same. To be verified.

✅ .provenance.document.id is an identifier for document. (e.g. 36a391f6-d118-4fd7-be51-9196b0f3184f), .biomaterial_core.biomaterial_id is an unique ID for the biomaterial. (e.g. Hip9_well_4609)

donor_organism.biomaterial_core.biomaterial_id, donor_organism.provenance.document_id

Same. To be verified.

✅

specimen_from_organism.provenance.document_id, sample.provenance.document_id, sample.biomaterial_core.biomaterial_id

Same. To be verified.

✅ Caveat with "sample": Since "sample" is a virtual type assigned to downstream biomaterials (specimen, organoid, or cell line) during index time, the document_id of the sample may or may not be the same document_id of the specimen (or orgnaoid, or cell line).

sequencing_input.provenance.document_id, sequencing_input.biomaterial_core.biomaterial_id

Same. To be verified.

✅

Missing critical fields

cell_suspension.plate_based_sequencing.plate_label

cell_suspension.plate_based_sequencing.well_label

cell_suspension.plate_based_sequencing.well_quality

cell_suspension.timecourse.value

cell_suspension.timecourse.unit.text

We don't index these. Would be difficult to add. Implement as part of phase II.

⚠️ Index and add fields to compact manifest.

#3846
#3847

donor_organism.gestational_age

donor_organism.is_living

donor_organism.human_specific.ethnicity.text

donor_organism.mouse_specific.strain.text

We don't index these. Seems important enough to add for phase I.

⚠️ Index and add fields to compact manifest.

#3849

protocol.type.text

We don't index this. Seems important enough to add for phase I.

⚠️ Index and add field to compact manifest.

#3850

protocol.protocol_core.protocol_id

We index this. Include in manifest.

✅ Note: Field is currently only indexed for analysis_protocol documents.
⚠️ Potentially index field for all protocol types and them to compact manifest.

#3851

Missing required fields:

cell_line.type

Not sure if we index this. Add for phase I.

✅ Field is currently indexed.
⚠️ Add field to compact manifest.

#3852

cell_line.model_organ.text

We index this. Add for phase I.

✅ Note: With ontologies we index ontology_label value.
⚠️ Add field to compact manifest.

#3853

donor_organism.biomaterial_core.ncbi_taxon_id

We index this. Add for phase I.

❌ We do not currently index the ncbi_taxion_id from any biomaterial.
⚠️ Index and add field to compact manifest.

#3854

analysis_file.file_core.file_name

analysis_file.file_core.format

image_file.file_core.file_name

image_file.file_core.format

reference_file.file_core.file_name

reference_file.file_core.format

sequence_file.file_core.file_name

sequence_file.file_core.format

supplementary_file.file_core.file_name

supplementary_file.file_core.format

We index these and include them in the manifest, just generically as file_name, content_description and file_format. To be verified.

✅

analysis_file.file_core.content_description.text

image_file.file_core.content_description.text

reference_file.file_core.content_description.text

sequence_file.file_core.content_description.text

supplementary_file.file_core.content_description.text

⚠️ Add fields to compact manifest.

#3855

reference_file.assembly_type

reference_file.ncbi_taxon_id

reference_file.reference_type

reference_file.reference_version

reference_file.genus_species.text

⚠️ Index and add fields to compact manifest.

#3856

library_preparation_protocol.end_bias

library_preparation_protocol.strand

library_preparation_protocol.input_nucleic_acid_molecule.text

Not sure if we index these. We should. Add for phase I.

⚠️ Index and add fields to compact manifest.

#3857

sequence_file.lane_index

I believe we already index this and put it in the manifest. Just as a generically named column called file_read_index. To be verified.

❌ Field is not currently in the compact manifest.
⚠️ Add field to compact manifest.

#3858

dsotirho-ucsc · 2021-11-24T19:38:56Z

@hannes-ucsc : "I will review Daniel's findings and come up with a breakdown of tickets, and make this an epic of those tickets."

hannes-ucsc · 2022-04-20T17:59:11Z

De-prioritizing given the changed priorities WRT AnVIL.

hannes-ucsc · 2024-04-04T17:51:33Z

Unlike AnVIL, HCA has an order of magnitude more metadata properties. Due to the sheer number of properties that are being requested for inclusion, this can't be a use-case for the compact manifest. The verbatim JSONL manifest (part of epic #6121) we're working on will ultimately include all properties of all entities.

We should take another look at determining a curated list of important properties.

bvizzier-ucsc · 2024-05-01T19:15:51Z

I haven't heard this discussed recently within the HCA team. As such, I'm going to assign it a low priority.

github-actions bot added the orange [process] Done by the Azul team label Oct 14, 2021

theathorn added code [subject] Production code enh [type] New feature or request labels Oct 14, 2021

melainalegaspi assigned theathorn Oct 14, 2021

melainalegaspi added the spike:1 [process] Spike estimate of one point label Oct 14, 2021

melainalegaspi assigned hannes-ucsc and unassigned theathorn Oct 18, 2021

hannes-ucsc removed their assignment Oct 19, 2021

melainalegaspi assigned theathorn Oct 19, 2021

theathorn mentioned this issue Oct 25, 2021

Add lane index and barcode fields to all manifests except curl #2781

Closed

theathorn removed their assignment Oct 27, 2021

melainalegaspi assigned theathorn Oct 28, 2021

melainalegaspi assigned hannes-ucsc and unassigned theathorn Nov 1, 2021

hannes-ucsc removed their assignment Nov 5, 2021

melainalegaspi added spike:3 [process] Spike estimate of three points and removed spike:1 [process] Spike estimate of one point labels Nov 5, 2021

melainalegaspi assigned dsotirho-ucsc Nov 5, 2021

dsotirho-ucsc assigned hannes-ucsc and unassigned dsotirho-ucsc Nov 24, 2021

hannes-ucsc added the epic [type] Issue consists of multiple smaller issues label Feb 11, 2022

hannes-ucsc removed their assignment Feb 11, 2022

dsotirho-ucsc mentioned this issue Mar 22, 2022

Remove sequencing processes from index #3972

Open

hannes-ucsc changed the title ~~Improve usability of the compact manifest~~ Improve usability of the compact manifest for HCA Apr 4, 2024

bvizzier-ucsc added the -- [priority] Low label May 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve usability of the compact manifest for HCA #3537

Improve usability of the compact manifest for HCA #3537

theathorn commented Oct 14, 2021 •

edited

Loading

melainalegaspi commented Oct 14, 2021

melainalegaspi commented Oct 18, 2021

hannes-ucsc commented Oct 19, 2021

theathorn commented Oct 19, 2021 •

edited

Loading

dsotirho-ucsc commented Oct 19, 2021

theathorn commented Oct 27, 2021 •

edited

Loading

melainalegaspi commented Nov 1, 2021

hannes-ucsc commented Nov 5, 2021

hannes-ucsc commented Nov 5, 2021 •

edited

Loading

dsotirho-ucsc commented Nov 22, 2021 •

edited by hannes-ucsc

Loading

dsotirho-ucsc commented Nov 24, 2021

hannes-ucsc commented Apr 20, 2022

hannes-ucsc commented Apr 4, 2024

bvizzier-ucsc commented May 1, 2024

Improve usability of the compact manifest for HCA #3537

Improve usability of the compact manifest for HCA #3537

Comments

theathorn commented Oct 14, 2021 • edited Loading

melainalegaspi commented Oct 14, 2021

melainalegaspi commented Oct 18, 2021

hannes-ucsc commented Oct 19, 2021

theathorn commented Oct 19, 2021 • edited Loading

dsotirho-ucsc commented Oct 19, 2021

theathorn commented Oct 27, 2021 • edited Loading

melainalegaspi commented Nov 1, 2021

hannes-ucsc commented Nov 5, 2021

hannes-ucsc commented Nov 5, 2021 • edited Loading

dsotirho-ucsc commented Nov 22, 2021 • edited by hannes-ucsc Loading

dsotirho-ucsc commented Nov 24, 2021

hannes-ucsc commented Apr 20, 2022

hannes-ucsc commented Apr 4, 2024

bvizzier-ucsc commented May 1, 2024

theathorn commented Oct 14, 2021 •

edited

Loading

theathorn commented Oct 19, 2021 •

edited

Loading

theathorn commented Oct 27, 2021 •

edited

Loading

hannes-ucsc commented Nov 5, 2021 •

edited

Loading

dsotirho-ucsc commented Nov 22, 2021 •

edited by hannes-ucsc

Loading