Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change hashing strategy for StudyLocusId generation in StudyLocus object #3448

Closed
3 of 6 tasks
project-defiant opened this issue Sep 9, 2024 · 5 comments · Fixed by opentargets/gentropy#783
Closed
3 of 6 tasks
Assignees
Labels
Data Relates to Open Targets data team

Comments

@project-defiant
Copy link

project-defiant commented Sep 9, 2024

As a developer I want change the hashing strategy to md5 when calculating StudyLocusId because current implementation requires casting big integer to a string to be ingested by the platform.

Background

This task is required due to the fact we do not create proper hashes within StudyLocusId field of StudyLocus object.

Tasks

  • Fix column names that are required to generate the hash (studyId, variantId and finemappingMethod).
  • Change schema of the StudyLocus object to have StudyLocusId as a StringType.
  • Change hashing strategy.
  • Recreate hashes of all datasource credible_set datasets.

Acceptance tests

  • When all of staging bucket credilble_sets have md5 hashes recalculated
  • StudyLocusId field should be stringType
@project-defiant project-defiant self-assigned this Sep 9, 2024
@project-defiant
Copy link
Author

@DSuveges FYI

@d0choa d0choa added Data Relates to Open Targets data team Gentropy labels Sep 16, 2024
@d0choa d0choa assigned vivienho and unassigned project-defiant Sep 23, 2024
@d0choa
Copy link
Contributor

d0choa commented Sep 23, 2024

Reassigning to @vivienho based on discussions with @DSuveges

@DSuveges the ticket might not have the right context, so @vivienho might need some extra guidelines

The change in the hashes will break all the data compatibility, so @project-defiant, you might want to keep an eye on it

Related to #3535

@project-defiant
Copy link
Author

@d0choa we should not merge this unless we are sure we can reparse all indices for credible sets, or recalculate entire study_locus

@DSuveges
Copy link

I think we'll re-calculate the entire dataset a few time till the release.

@d0choa
Copy link
Contributor

d0choa commented Sep 23, 2024

It's never a good time to make this change. But it will get a lot worse soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data Relates to Open Targets data team
Projects
None yet
4 participants