-
Notifications
You must be signed in to change notification settings - Fork 0
Field mapping docs #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,4 +1,102 @@ | ||
| """ENCODE metadata TSV client and C2M2 transformation service.""" | ||
| """ENCODE metadata TSV client and CFDB transformation service. | ||
|
|
||
| Fetches the released-experiment metadata TSV from ENCODE and transforms | ||
| each row into a CFDB file document. | ||
|
|
||
| Metadata URL | ||
| ------------ | ||
| https://www.encodeproject.org/metadata/?type=Experiment&status=released | ||
|
|
||
| Field Mapping (ENCODE TSV → CFDB) | ||
| ---------------------------------- | ||
|
|
||
| File | ||
| ~~~~ | ||
| File accession → local_id | ||
| File download URL → access_url, filename (derived) | ||
| File format → file_format (EDAM-mapped) | ||
| Output type → data_type (EDAM-mapped), output_type | ||
| Assay → assay_type (OBI-mapped) | ||
| Size → size_in_bytes | ||
| md5sum → md5 | ||
| File Status → status | ||
| Experiment date released → creation_time | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note: check if "release" date in 4DN also maps to this key |
||
| File accession → persistent_id (derived URL) | ||
| File assembly → genome_assembly | ||
| Genome annotation → genome_annotation | ||
| File format type → output_type_detail | ||
| Biological replicate(s) → biological_replicates | ||
| Technical replicate(s) → technical_replicates | ||
|
|
||
| Enriched File | ||
| ~~~~~~~~~~~~~ | ||
| Read length → extra.encode.read_length | ||
| Mapped read length → extra.encode.mapped_read_length | ||
| Run type → extra.encode.run_type | ||
| Paired end → extra.encode.paired_end | ||
| Paired with → extra.encode.paired_with | ||
| Index of → extra.encode.index_of | ||
| Derived from → extra.encode.derived_from | ||
| Controlled by → extra.encode.controlled_by | ||
| s3_uri → extra.encode.s3_uri | ||
| Azure URL → extra.encode.azure_url | ||
| File analysis title → extra.encode.file_analysis_title | ||
| File analysis status → extra.encode.file_analysis_status | ||
| Audit WARNING → extra.encode.audit_warning | ||
| Audit NOT_COMPLIANT → extra.encode.audit_not_compliant | ||
| Audit ERROR → extra.encode.audit_error | ||
|
|
||
| Collection | ||
| ~~~~~~~~~~ | ||
| Experiment accession → collections[].local_id, name, persistent_id | ||
| Lab → collections[].lab | ||
| Assay → collections[].experiment_type | ||
| Experiment target → collections[].experiment_target | ||
| Library made from → collections[].analyte_class | ||
|
|
||
| Enriched Collection | ||
| ~~~~~~~~~~~~~~~~~~~ | ||
| Project → collections[].extra.encode.project | ||
| Platform → collections[].extra.encode.platform | ||
| dbxrefs → collections[].extra.encode.dbxrefs | ||
| RBNS protein concentration → collections[].extra.encode.rbns_protein_concentration | ||
|
|
||
| Biosample | ||
| ~~~~~~~~~ | ||
| Biosample term name → collections[].biosamples[].local_id | ||
| Biosample term id / term name → collections[].biosamples[].anatomy | ||
|
|
||
| Enriched Biosample | ||
| ~~~~~~~~~~~~~~~~~~ | ||
| Biosample type → …biosamples[].extra.encode.biosample_type | ||
| Biosample treatments → …biosamples[].extra.encode.biosample_treatments | ||
| Biosample treatments amount → …biosamples[].extra.encode.biosample_treatments_amount | ||
| Biosample treatments duration → …biosamples[].extra.encode.biosample_treatments_duration | ||
| Biosample genetic mods (*) → …biosamples[].extra.encode.biosample_genetic_modifications | ||
| Library made from → …biosamples[].extra.encode.library_made_from | ||
| Library depleted in → …biosamples[].extra.encode.library_depleted_in | ||
| Library extraction method → …biosamples[].extra.encode.library_extraction_method | ||
| Library lysis method → …biosamples[].extra.encode.library_lysis_method | ||
| Library crosslinking method → …biosamples[].extra.encode.library_crosslinking_method | ||
| Library strand specific → …biosamples[].extra.encode.library_strand_specific | ||
| Library fragmentation method → …biosamples[].extra.encode.library_fragmentation_method | ||
| Library size range → …biosamples[].extra.encode.library_size_range | ||
|
|
||
| (*) Full TSV column: "Biosample genetic modifications methods/categories/ | ||
| targets/gene targets/site coordinates/zygosity" | ||
|
|
||
| Subject | ||
| ~~~~~~~ | ||
| Donor(s) → collections[].biosamples[].subjects[].local_id | ||
| Biosample organism → collections[].biosamples[].subjects[].taxonomy | ||
|
|
||
| DCC | ||
| ~~~ | ||
| Static / config-derived: dcc.id, dcc.dcc_name, dcc.dcc_abbreviation, | ||
| dcc.dcc_description, dcc.contact_email, | ||
| dcc.contact_name, dcc.dcc_url, | ||
| dcc.project_id_namespace, dcc.project_local_id | ||
| """ | ||
|
|
||
| import logging | ||
| import re | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,4 +1,68 @@ | ||
| """HuBMAP Search API integration for access level metadata and enrichment.""" | ||
| """HuBMAP Search API integration for access level metadata and enrichment. | ||
|
|
||
| Fetches dataset, donor, and file metadata from the HuBMAP Search API | ||
| (Elasticsearch-backed) to enrich C2M2-materialized documents. Three | ||
| enrichment targets run during sync: collections and subjects | ||
| (pre-materialization) and files (post-materialization). | ||
|
|
||
| API URLs | ||
| -------- | ||
| Bulk dataset search (search_after pagination): | ||
| https://search.api.hubmapconsortium.org/v3/portal/search | ||
| Entity lookup (single UUID): | ||
| https://search.api.hubmapconsortium.org/v3/entities/{uuid} | ||
|
|
||
| Entity Matching | ||
| --------------- | ||
| Collection persistent_id matches doi_url | ||
| Subject local_id contains donor uuid | ||
| File matched via collection doi_url → dataset, then filename | ||
|
|
||
| Field Mapping (HuBMAP Search API → CFDB) | ||
| ------------------------------------------ | ||
|
|
||
| File (post-materialization) | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| data_access_level → data_access_level | ||
| ingest_metadata.workflow_description → genome_assembly (regex-derived) | ||
|
|
||
| Enriched File | ||
| ~~~~~~~~~~~~~ | ||
| files[].rel_path → extra.hubmap.rel_path | ||
| files[].is_data_product → extra.hubmap.is_data_product | ||
|
|
||
| Collection (pre-materialization) | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| dataset_type → collections[].experiment_type | ||
| analyte_class → collections[].analyte_class | ||
|
|
||
| Enriched Collection | ||
| ~~~~~~~~~~~~~~~~~~~ | ||
| pipeline → collections[].extra.hubmap.pipeline | ||
| processing → collections[].extra.hubmap.processing | ||
| group_name → collections[].extra.hubmap.group_name | ||
| visualization → collections[].extra.hubmap.visualization | ||
| vitessce-hints → collections[].extra.hubmap.vitessce_hints | ||
| metadata → collections[].extra.hubmap.metadata | ||
|
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We don't populate the promoted |
||
|
|
||
| Enriched Subject (pre-materialization, from donor.mapped_metadata) | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| sex → subjects[].extra.hubmap.sex | ||
| race → subjects[].extra.hubmap.race | ||
| age_value → subjects[].extra.hubmap.age_value | ||
| age_unit → subjects[].extra.hubmap.age_unit | ||
| height_value → subjects[].extra.hubmap.height_value | ||
| height_unit → subjects[].extra.hubmap.height_unit | ||
| weight_value → subjects[].extra.hubmap.weight_value | ||
| weight_unit → subjects[].extra.hubmap.weight_unit | ||
| body_mass_index_value → subjects[].extra.hubmap.body_mass_index_value | ||
| body_mass_index_unit → subjects[].extra.hubmap.body_mass_index_unit | ||
| cause_of_death → subjects[].extra.hubmap.cause_of_death | ||
| death_event → subjects[].extra.hubmap.death_event | ||
| mechanism_of_injury → subjects[].extra.hubmap.mechanism_of_injury | ||
| medical_history → subjects[].extra.hubmap.medical_history | ||
| social_history → subjects[].extra.hubmap.social_history | ||
| """ | ||
|
|
||
| import asyncio | ||
| import logging | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine. Note that there is also a
s3_urikey that could be a substitute or fallback. It might be more performant too.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed that they provide azure URLs too now. And you include both in enriched. Good!