Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve usability of the compact manifest for HCA #3537

Open
theathorn opened this issue Oct 14, 2021 · 14 comments
Open

Improve usability of the compact manifest for HCA #3537

theathorn opened this issue Oct 14, 2021 · 14 comments
Labels
-- [priority] Low code [subject] Production code enh [type] New feature or request epic [type] Issue consists of multiple smaller issues orange [process] Done by the Azul team spike:3 [process] Spike estimate of three points

Comments

@theathorn
Copy link

theathorn commented Oct 14, 2021

From Parisa Nejad's current metadata spreadsheet:

  • Possible duplicated values are color-code on the metadata tab
  • Remove non-useful fields (e.g. "bundle_uuid")
  • Add useful fields listed in the "critical" and "missing required fields" tabs
  • Don't include empty columns in the manifest ("example" tab does contain empty columns, e.g. "specimen_from_organism.diseases")
  • Document the columns included in the manifest

Do these columns contain true duplicated values?

  • bundle_uuid, sequencing_process.provenance.document_id
  • file_document_id, file_uuid
  • cell_suspension.provenance.document_id, cell_suspension.biomaterial_core.biomaterial_id
  • donor_organism.biomaterial_core.biomaterial_id, donor_organism.provenance.document_id
  • specimen_from_organism.provenance.document_id, sample.provenance.document_id, sample.biomaterial_core.biomaterial_id
  • sequencing_input.provenance.document_id, sequencing_input.biomaterial_core.biomaterial_id

Non-useful fields?

  • bundle_uuid
  • bundle_version
  • file_version

Missing critical fields:

  • cell_suspension.plate_based_sequencing.plate_label
  • cell_suspension.plate_based_sequencing.well_label
  • cell_suspension.plate_based_sequencing.well_quality
  • cell_suspension.timecourse.value
  • cell_suspension.timecourse.unit.text
  • donor_organism.gestational_age
  • donor_organism.is_living
  • donor_organism.human_specific.ethnicity.text
  • donor_organism.mouse_specific.strain.text
  • project.array_express_accessions
  • project.biostudies_accessions
  • project.dbgap_accessions
  • project.ega_accessions
  • project.geo_series_accessions
  • project.insdc_project_accessions
  • project.insdc_study_accessions
  • project.funders.grant_id
  • project.funders.organization
  • project_core.project_description
  • project_core.project_title
  • protocol.type.text
  • protocol.protocol_core.protocol_id

Missing required fields:

  • cell_line.type
  • cell_line.model_organ.text
  • cell_line.biomaterial_core.ncbi_taxon_id
  • cell_line.cell_cycle.text
  • cell_suspension.selected_cell_types.cell_type_ontology.text
  • cell_suspension.biomaterial_core.ncbi_taxon_id
  • donor_organism.biomaterial_core.ncbi_taxon_id
  • organoid.biomaterial_core.ncbi_taxon_id
  • specimen_from_organism.biomaterial_core.ncbi_taxon_id
  • analysis_file.file_core.file_name
  • analysis_file.file_core.format
  • analysis_file.file_core.content_description.text
  • image_file.file_core.file_name
  • image_file.file_core.format
  • image_file.file_core.content_description.text
  • reference_file.assembly_type
  • reference_file.ncbi_taxon_id
  • reference_file.reference_type
  • reference_file.reference_version
  • reference_file.genus_species.text
  • reference_file.file_core.file_name
  • reference_file.file_core.format
  • reference_file.file_core.content_description.text
  • sequence_file.file_core.file_name
  • sequence_file.file_core.format
  • sequence_file.file_core.content_description.text
  • supplementary_file.file_core.file_name
  • supplementary_file.file_core.format
  • supplementary_file.file_core.content_description.text
  • library_preparation_protocol.end_bias
  • library_preparation_protocol.strand
  • library_preparation_protocol.input_nucleic_acid_molecule.text
  • sequencing_protocol.method.text
  • donor_organism.death.cause_of_death
  • library_preparation_protocol.input_nucleic_acid_molecule.text
  • imaging_protocol.probe.subcellular_structure.cellular_component_ontology.text
  • project.contributors.project_role.text
  • project.contributors.name
  • project.publications.publication.authors
  • project.publications.publication.official_hca_publication
  • project.publications.publication.title
  • imaging_protocol.channel.channel_id
  • imaging_protocol.channel.excitation_wavelength
  • imaging_protocol.channel.exposure_time
  • imaging_protocol.channel.filter_range
  • imaging_protocol.channel.multiplexed
  • imaging_protocol.channel.target_fluorophore
  • imaging_protocol.probe.channel_label
  • imaging_protocol.probe.fluorophore
  • imaging_protocol.probe.probe_label
  • imaging_protocol.probe.probe_sequence
  • imaging_protocol.probe.target_codebook_label
  • imaging_protocol.probe.target_label
  • imaging_protocol.probe.target_name
  • imaging_protocol.probe.assay_type.text
  • imaging_protocol.probe.probe_reagents.catalog_number
  • imaging_protocol.probe.probe_reagents.expiry_date
  • imaging_protocol.probe.probe_reagents.kit_titer
  • imaging_protocol.probe.probe_reagents.lot_number
  • imaging_protocol.probe.probe_reagents.manufacturer
  • imaging_protocol.probe.probe_reagents.retail_name
  • imaging_protocol.probe.subcellular_structure.text

Document the columns included in the manifest:

  • e.g. "project.provenance.document_id" is the project UUID

Consolidating from #2781:
User Story: As an end-user that wants to use the HCA data I want to understand how fastq files are grouped together and understand the distribution of the barcodes in the files so that I can run my analysis pipelines automatically, extracting the information from the metadata:

  • library_preparation_protocol.cell_barcode.barcode_read
  • library_preparation_protocol.cell_barcode.barcode_offset
  • library_preparation_protocol.cell_barcode.barcode_length
  • library_preparation_protocol.cell_barcode.white_list_file
  • library_preparation_protocol.umi_barcode.barcode_read
  • library_preparation_protocol.umi_barcode.barcode_offset
  • library_preparation_protocol.umi_barcode.barcode_length
  • library_preparation_protocol.umi_barcode.white_list_file
  • sequence_file.lane_index
@github-actions github-actions bot added the orange [process] Done by the Azul team label Oct 14, 2021
@theathorn theathorn added code [subject] Production code enh [type] New feature or request labels Oct 14, 2021
@melainalegaspi melainalegaspi added the spike:1 [process] Spike estimate of one point label Oct 14, 2021
@melainalegaspi
Copy link

@theathorn add details from spreadsheet.

@melainalegaspi
Copy link

@hannes-ucsc to review description.

@hannes-ucsc
Copy link
Member

Don't understand the syntax Donor organism>Mouse-specific that is used in some of the bullet points in the above description. Clarify in PL.

@hannes-ucsc hannes-ucsc removed their assignment Oct 19, 2021
@theathorn
Copy link
Author

theathorn commented Oct 19, 2021

I spoke with Parisa Nejad about this issue.

Donor organism>Mouse-specific refers to an entity in the metadata dictionary.

The wranglers would need to do more work to define a comprehensive list of the individual required fields.
However, Parisa thought that making the submitted metadata spreadsheet available for download (alongside the compact manifest) would allow for minimal changes to the manifest.

Why do some column titles in the manifest (e.g. file_sha256) not correspond to actual metadata schemas definitions (is it the same as file_descriptor.sha256)?

Need to document the meaning of all internally generated fields (e.g. file_drs_uri) in the manifest.

@dsotirho-ucsc
Copy link
Contributor

@theathorn Should also add submissionDate, updateDate, aggregateSubmissionDate, aggregateUpdateDate.

@theathorn
Copy link
Author

theathorn commented Oct 27, 2021

The "Phase 1" plan is to submit the Ingest metadata spreadsheet to TDR with some associated metadata so that it can be presented for download in the Data Browser alongside the project (compact) manifest.

"Phase 2" is to work out the full solution (e.g. JSON files or ???).

As part of the Phase 1 work let's look at which simple set of high priority changes we can make to improve the usability of the compact manifest.

@theathorn theathorn removed their assignment Oct 27, 2021
@melainalegaspi
Copy link

@hannes-ucsc to reply to the individual points with respect to what is feasible for Phase 1 and what needs to be postponed to Phase 2.

@hannes-ucsc
Copy link
Member

Do these columns contain true duplicated values

  • bundle_uuid, sequencing_process.provenance.document_id

This is incidental. It correlates with Ingest's decision to define a subgraph for each sequencing process. This can change and we shouldn't depend on this incidental correspondence.

  • file_document_id, file_uuid

I don't think this is universally true. To be verified.

  • cell_suspension.provenance.document_id, cell_suspension.biomaterial_core.biomaterial_id

This is not true. The document ID is allocated by Ingest. The biomaterial_id is filled in by the wrangler. They are usually not the same. To be verified.

  • donor_organism.biomaterial_core.biomaterial_id, donor_organism.provenance.document_id

Same. To be verified.

  • specimen_from_organism.provenance.document_id, sample.provenance.document_id, sample.biomaterial_core.biomaterial_id

Same. To be verified.

  • sequencing_input.provenance.document_id, sequencing_input.biomaterial_core.biomaterial_id

Same. To be verified.

Non-useful fields

  • bundle_uuid

This field is the subgraph ID aka links_id. It is crucially important.

  • bundle_version

This field is the subgraph version aka links version. It is crucially important.

  • file_version>

Is the version of the data file. It is crucially important for determining if a file changed.

Missing critical fields

  • cell_suspension.plate_based_sequencing.plate_label
  • cell_suspension.plate_based_sequencing.well_label
  • cell_suspension.plate_based_sequencing.well_quality
  • cell_suspension.timecourse.value
  • cell_suspension.timecourse.unit.text

We don't index these. Would be difficult to add. Implement as part of phase II.

  • donor_organism.gestational_age
  • donor_organism.is_living
  • donor_organism.human_specific.ethnicity.text
  • donor_organism.mouse_specific.strain.text

We don't index these. Seems important enough to add for phase I.

  • project.array_express_accessions
  • project.biostudies_accessions
  • project.dbgap_accessions
  • project.ega_accessions
  • project.geo_series_accessions
  • project.insdc_project_accessions
  • project.insdc_study_accessions
  • project.funders.grant_id
  • project.funders.organization
  • project_core.project_description
  • project_core.project_title

Won't fix. Since these are at the project level it would be highly duplicative. Many rows, or, if the manifest is for a single project, all rows, would contain the same values. The values tend to be large, aggravating the issue. If someone wants these values, they can use the /index/project endpoint.

  • protocol.type.text

We don't index this. Seems important enough to add for phase I.

  • protocol.protocol_core.protocol_id

We index this. Include in manifest.

Missing required fields:>

  • cell_line.type

Not sure if we index this. Add for phase I.

  • cell_line.model_organ.text

We index this. Add for phase I.

  • cell_line.biomaterial_core.ncbi_taxon_id

This field is badly modeled. A cell line shouldn't have a taxon ID. Only whole organisms (donors) are of a species and the cell line should be linked to a donor organism.

HumanCellAtlas/metadata-schema#1415

  • cell_line.cell_cycle.text
  • cell_suspension.selected_cell_types.cell_type_ontology.text
  • cell_suspension.biomaterial_core.ncbi_taxon_id

See cell_line.biomaterial_core.ncbi_taxon_id

  • donor_organism.biomaterial_core.ncbi_taxon_id

We index this. Add for phase I.

  • organoid.biomaterial_core.ncbi_taxon_id

See cell_line.biomaterial_core.ncbi_taxon_id

  • specimen_from_organism.biomaterial_core.ncbi_taxon_id

See cell_line.biomaterial_core.ncbi_taxon_id

  • analysis_file.file_core.file_name
  • analysis_file.file_core.format
  • analysis_file.file_core.content_description.text
  • image_file.file_core.file_name
  • image_file.file_core.format
  • image_file.file_core.content_description.text
  • reference_file.assembly_type
  • reference_file.ncbi_taxon_id
  • reference_file.reference_type
  • reference_file.reference_version
  • reference_file.genus_species.text
  • reference_file.file_core.file_name
  • reference_file.file_core.format
  • reference_file.file_core.content_description.text
  • sequence_file.file_core.file_name
  • sequence_file.file_core.format
  • sequence_file.file_core.content_description.text
  • supplementary_file.file_core.file_name
  • supplementary_file.file_core.format
  • supplementary_file.file_core.content_description.text

We index these and include them in the manifest, just generically as file_name, content_description and file_format. To be verified.

  • library_preparation_protocol.end_bias
  • library_preparation_protocol.strand
  • library_preparation_protocol.input_nucleic_acid_molecule.text

Not sure if we index these. We should. Add for phase I.

  • sequencing_protocol.method.text
  • donor_organism.death.cause_of_death
  • library_preparation_protocol.input_nucleic_acid_molecule.text
  • imaging_protocol.probe.subcellular_structure.cellular_component_ontology.text

See other imaging fields below.

  • project.contributors.project_role.text
  • project.contributors.name
  • project.publications.publication.authors
  • project.publications.publication.official_hca_publication
  • project.publications.publication.title

See other project fields above.

  • imaging_protocol.channel.channel_id
  • imaging_protocol.channel.excitation_wavelength
  • imaging_protocol.channel.exposure_time
  • imaging_protocol.channel.filter_range
  • imaging_protocol.channel.multiplexed
  • imaging_protocol.channel.target_fluorophore
  • imaging_protocol.probe.channel_label
  • imaging_protocol.probe.fluorophore
  • imaging_protocol.probe.probe_label
  • imaging_protocol.probe.probe_sequence
  • imaging_protocol.probe.target_codebook_label
  • imaging_protocol.probe.target_label
  • imaging_protocol.probe.target_name
  • imaging_protocol.probe.assay_type.text
  • imaging_protocol.probe.probe_reagents.catalog_number
  • imaging_protocol.probe.probe_reagents.expiry_date
  • imaging_protocol.probe.probe_reagents.kit_titer
  • imaging_protocol.probe.probe_reagents.lot_number
  • imaging_protocol.probe.probe_reagents.manufacturer
  • imaging_protocol.probe.probe_reagents.retail_name
  • imaging_protocol.probe.subcellular_structure.text>

There are no projects with imaging data. It would be pointless to add these fields.

Document the columns included in the manifest:

  • e.g. "project.provenance.document_id" is the project UUID

This is self-explanatory. Every entity has .provenance.document_id. It should be clear that this field is it's ID. When looking at this value, it is obvious that it is a UUID.

Consolidating from #2781:
User Story: As an end-user that wants to use the HCA data I want to understand how fastq files are grouped together and understand the distribution of the barcodes in the files so that I can run my analysis pipelines automatically, extracting the information from the metadata:

  • library_preparation_protocol.cell_barcode.barcode_read
  • library_preparation_protocol.cell_barcode.barcode_offset
  • library_preparation_protocol.cell_barcode.barcode_length
  • library_preparation_protocol.cell_barcode.white_list_file
  • library_preparation_protocol.umi_barcode.barcode_read
  • library_preparation_protocol.umi_barcode.barcode_offset
  • library_preparation_protocol.umi_barcode.barcode_length
  • library_preparation_protocol.umi_barcode.white_list_file
  • sequence_file.lane_index

I believe we already index this and put it in the manifest. Just as a generically named colum file_read_index. To be verified.

@hannes-ucsc
Copy link
Member

hannes-ucsc commented Nov 5, 2021

Spike again to verify my claims (look for "To be verified").

@hannes-ucsc hannes-ucsc removed their assignment Nov 5, 2021
@melainalegaspi melainalegaspi added spike:3 [process] Spike estimate of three points and removed spike:1 [process] Spike estimate of one point labels Nov 5, 2021
@dsotirho-ucsc
Copy link
Contributor

dsotirho-ucsc commented Nov 22, 2021

Legend:

  • ✅ = Comment was verified
  • ❌ = Comment needs correction
  • ⚠️ = Action item

Do these columns contain true duplicated values

  • file_document_id, file_uuid

I don't think this is universally true. To be verified.

file_document_id is .provenance.document_id from the metadata, file_uuid is .file_id from the descriptor.

  • cell_suspension.provenance.document_id, cell_suspension.biomaterial_core.biomaterial_id

This is not true. The document ID is allocated by Ingest. The biomaterial_id is filled in by the wrangler. They are usually not the same. To be verified.

.provenance.document.id is an identifier for document. (e.g. 36a391f6-d118-4fd7-be51-9196b0f3184f), .biomaterial_core.biomaterial_id is an unique ID for the biomaterial. (e.g. Hip9_well_4609)

  • donor_organism.biomaterial_core.biomaterial_id, donor_organism.provenance.document_id

Same. To be verified.

  • specimen_from_organism.provenance.document_id, sample.provenance.document_id, sample.biomaterial_core.biomaterial_id

Same. To be verified.

✅ Caveat with "sample": Since "sample" is a virtual type assigned to downstream biomaterials (specimen, organoid, or cell line) during index time, the document_id of the sample may or may not be the same document_id of the specimen (or orgnaoid, or cell line).

  • sequencing_input.provenance.document_id, sequencing_input.biomaterial_core.biomaterial_id

Same. To be verified.

Missing critical fields

  • cell_suspension.plate_based_sequencing.plate_label
  • cell_suspension.plate_based_sequencing.well_label
  • cell_suspension.plate_based_sequencing.well_quality
  • cell_suspension.timecourse.value
  • cell_suspension.timecourse.unit.text

We don't index these. Would be difficult to add. Implement as part of phase II.

⚠️ Index and add fields to compact manifest.

#3846
#3847

  • donor_organism.gestational_age
  • donor_organism.is_living
  • donor_organism.human_specific.ethnicity.text
  • donor_organism.mouse_specific.strain.text

We don't index these. Seems important enough to add for phase I.

⚠️ Index and add fields to compact manifest.

#3849

  • protocol.type.text

We don't index this. Seems important enough to add for phase I.

⚠️ Index and add field to compact manifest.

#3850

  • protocol.protocol_core.protocol_id

We index this. Include in manifest.

✅ Note: Field is currently only indexed for analysis_protocol documents.
⚠️ Potentially index field for all protocol types and them to compact manifest.

#3851

Missing required fields:

  • cell_line.type

Not sure if we index this. Add for phase I.

✅ Field is currently indexed.
⚠️ Add field to compact manifest.

#3852

  • cell_line.model_organ.text

We index this. Add for phase I.

✅ Note: With ontologies we index ontology_label value.
⚠️ Add field to compact manifest.

#3853

  • donor_organism.biomaterial_core.ncbi_taxon_id

We index this. Add for phase I.

❌ We do not currently index the ncbi_taxion_id from any biomaterial.
⚠️ Index and add field to compact manifest.

#3854

  • analysis_file.file_core.file_name
  • analysis_file.file_core.format
  • image_file.file_core.file_name
  • image_file.file_core.format
  • reference_file.file_core.file_name
  • reference_file.file_core.format
  • sequence_file.file_core.file_name
  • sequence_file.file_core.format
  • supplementary_file.file_core.file_name
  • supplementary_file.file_core.format

We index these and include them in the manifest, just generically as file_name, content_description and file_format. To be verified.

  • analysis_file.file_core.content_description.text
  • image_file.file_core.content_description.text
  • reference_file.file_core.content_description.text
  • sequence_file.file_core.content_description.text
  • supplementary_file.file_core.content_description.text

⚠️ Add fields to compact manifest.

#3855

  • reference_file.assembly_type
  • reference_file.ncbi_taxon_id
  • reference_file.reference_type
  • reference_file.reference_version
  • reference_file.genus_species.text

⚠️ Index and add fields to compact manifest.

#3856

  • library_preparation_protocol.end_bias
  • library_preparation_protocol.strand
  • library_preparation_protocol.input_nucleic_acid_molecule.text

Not sure if we index these. We should. Add for phase I.

⚠️ Index and add fields to compact manifest.

#3857

  • sequence_file.lane_index

I believe we already index this and put it in the manifest. Just as a generically named column called file_read_index. To be verified.

❌ Field is not currently in the compact manifest.
⚠️ Add field to compact manifest.

#3858

@dsotirho-ucsc
Copy link
Contributor

@hannes-ucsc : "I will review Daniel's findings and come up with a breakdown of tickets, and make this an epic of those tickets."

@hannes-ucsc hannes-ucsc added the epic [type] Issue consists of multiple smaller issues label Feb 11, 2022
@hannes-ucsc hannes-ucsc removed their assignment Feb 11, 2022
@hannes-ucsc
Copy link
Member

De-prioritizing given the changed priorities WRT AnVIL.

@hannes-ucsc
Copy link
Member

Unlike AnVIL, HCA has an order of magnitude more metadata properties. Due to the sheer number of properties that are being requested for inclusion, this can't be a use-case for the compact manifest. The verbatim JSONL manifest (part of epic #6121) we're working on will ultimately include all properties of all entities.

We should take another look at determining a curated list of important properties.

@hannes-ucsc hannes-ucsc changed the title Improve usability of the compact manifest Improve usability of the compact manifest for HCA Apr 4, 2024
@bvizzier-ucsc
Copy link

I haven't heard this discussed recently within the HCA team. As such, I'm going to assign it a low priority.

@bvizzier-ucsc bvizzier-ucsc added the -- [priority] Low label May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
-- [priority] Low code [subject] Production code enh [type] New feature or request epic [type] Issue consists of multiple smaller issues orange [process] Done by the Azul team spike:3 [process] Spike estimate of three points
Projects
None yet
Development

No branches or pull requests

5 participants