Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: change StudyLocusId hashing method to md5 (and change StudyLocusId to string type) #783

Merged
merged 19 commits into from
Sep 30, 2024

Conversation

vivienho
Copy link
Contributor

@vivienho vivienho commented Sep 24, 2024

✨ Context

StudyLocusId is used as an identifier and does not need to be numerical. Changing it to string will make it easier on the backend side. Hashing strategy is changed to md5, which returns strings.

🛠 What does this PR implement

StudyLocusId is changed to string type in the schema and at relevant locations (mostly in tests).

The hashing strategy for generating the StudyLocusId is changed to md5.

A test (test_assign_study_locus_id__null_variant_id) was removed as validation steps elsewhere should have dropped null variant id cases before the assign_study_locus_id function.

🙈 Missing

🚦 Before submitting

  • Do these changes cover one single feature (one change at a time)?
  • Did you read the contributor guideline?
  • Did you make sure to update the documentation with your changes?
  • Did you make sure there is no commented out code in this PR?
  • Did you follow conventional commits standards in PR title and commit messages?
  • Did you make sure the branch is up-to-date with the dev branch?
  • Did you write any new necessary tests?
  • Did you make sure the changes pass local tests (make test)?
  • Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

@vivienho vivienho changed the title feat: change StudyLocusId hashing method to md5 (and change StudyLocusId to string type) feat: change StudyLocusId hashing method to md5 (and change StudyLocusId to string type) Sep 24, 2024
@vivienho vivienho marked this pull request as ready for review September 24, 2024 22:11
@vivienho vivienho requested a review from DSuveges September 25, 2024 08:11
@DSuveges
Copy link
Contributor

DSuveges commented Sep 25, 2024

Given you were done so fast, I would recommend to update the logic to be more abstract allowing generalisation of the identifier generation (we will need this in other datasets, eg l2g). If you notice, you don't really need the arguments here:

    def assign_study_locus_id(
        study_id_col: Column,
        variant_id_col: Column,
        finemapping_col: Column = None,
    ) -> Column:

Because you have this list:

columns = [study_id_col, variant_id_col, finemapping_col]

This can be a simple array of column names, which I believe should be a class class attribute for StudyLocus dataset. So the class itself would define what columns needs to be hashed for the identifier and in which order. Also, I think the actual hashing logic:

 hashable_columns = [f.when(column.cast("string").isNull(), f.lit("None"))
                                 .otherwise(column.cast("string"))
                                 for column in columns]
        return f.md5(f.concat(*hashable_columns))

should be in the Dataset class. And the assign_study_locus_id method should wraps that function:

    @staticmethod
    def assign_study_locus_id( ) -> Column:
        self._generate_identifier(self.uniqueness_defining_columns).alias("studyLocusId")

Where:

  • _generate_identifier is the hashing function in Dataset class, can be called from any dataset.
  • uniqueness_defining_columns is a class attribute defined for the given dataset.
  • This method returns the alias of the column, which is also dataset specific.

I have a tendency to over abstract everything, so it would be great to have a second opinion on this from @d0choa .

@DSuveges DSuveges mentioned this pull request Sep 27, 2024
@@ -1109,7 +1109,7 @@ def from_source(
"""
return StudyLocusGWASCatalog(
_df=gwas_associations.withColumn(
"studyLocusId", f.monotonically_increasing_id().cast(LongType())
"studyLocusId", f.monotonically_increasing_id().cast(StringType())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a dealbreaker, and has no impact whatsoever: this column is not a "real" studyLocusId: this column is temporarily generated to identify original rows of the GWAS Catalog association dataset before explosion. But it is fine.

Copy link
Contributor

@DSuveges DSuveges left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lot of changes, all looks great, let's hope nothing breaks. 🤞🏻

@DSuveges DSuveges merged commit 5c58e58 into dev Sep 30, 2024
5 checks passed
@DSuveges DSuveges deleted the vh-3448 branch September 30, 2024 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Convert StudyLocusId to String Change hashing strategy for StudyLocusId generation in StudyLocus object
2 participants