Skip to content

Model AD: Disease Correlation Data Transformation #179

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Jun 10, 2025

Conversation

beatrizsaldana
Copy link
Member

@beatrizsaldana beatrizsaldana commented May 19, 2025

Problem

The disease correlation data transformation process needed a robust implementation to handle the transformation of disease correlation results, model information, and allele information into a standardized format. The transformation needed to handle various edge cases and ensure data consistency while processing multiple mouse models and their associated genetic information. Jira ticket: MG-44

Solution

Implemented a data transformation pipeline that:

  1. Processes disease correlation results from multiple mouse models
  2. Integrates model and allele information for each model
  3. Handles data validation and error cases
  4. Transforms the data into expected format
  5. Created a utility function to validate inputs and edited model_details.py and immunohisto_transform.py to use the validation function
  6. Added other utility functions that seemed useful outside of this specific transform

New Utility Functions

Added several utility functions to support data transformation and validation:

  1. check_required_datasets_and_columns

    • Validates the presence of required datasets and their columns
    • Raises descriptive ValueError if any required datasets or columns are missing
    • Used to ensure data completeness before processing
  2. flatten_list

    • Recursively flattens nested lists into a single-level list
    • Handles arbitrary depth of nesting
    • Preserves all non-list elements in their original order
    • Useful for processing nested data structures
  3. remove_duplicates_keep_order

    • Removes duplicate elements from a list while preserving original order
    • Handles mixed data types
    • Used for deduplication of gene lists
  4. create_lookup

    • Creates a nested dictionary lookup from a pandas DataFrame
    • Groups data by a specified column
    • Handles multiple values for the same key by creating lists
    • Useful for efficient data lookups and transformations

Test

Disease Correlation Tests

  1. Basic Valid Input

    • Tests the transformation of standard input data with multiple models
    • Verifies correct handling of cluster information, modules, and correlation values
    • Validates proper type conversion of correlation and p-values to float
    • Ensures correct grouping of results by model, age, and sex
  2. Duplicate Results Handling

    • Tests the system's ability to handle duplicate entries in disease correlation results
    • Verifies that duplicate results are properly processed without data loss
  3. Duplicate Allele Information

    • Tests handling of duplicate gene entries in allele information
    • Ensures proper deduplication of gene information

Error Test Cases

  1. Dataset Validation

    • Tests for missing required datasets (e.g., model_info)
    • Verifies detection of inconsistent model information
    • Ensures proper error messages for missing datasets
  2. Column Validation

    • Tests for missing required columns in input datasets
    • Validates proper error handling for incomplete data structures

Utility Function Tests

  1. check_required_datasets_and_columns Tests

    • Tests successful validation with all required datasets and columns
    • Tests error handling for missing datasets
    • Tests error handling for missing columns
  2. flatten_list Tests

    • Tests empty list handling
    • Tests non-nested list handling
    • Tests single-level nested list handling
    • Tests multiple-level nested list handling
    • Tests mixed data type handling
  3. remove_duplicates_keep_order Tests

    • Tests empty list handling
    • Tests list with no duplicates
    • Tests list with duplicates
    • Tests mixed data type handling
    • Tests special case of True/1 equality
    • Tests order preservation
  4. create_lookup Tests

    • Tests creation of lookup dictionary from DataFrame
    • Tests handling of multiple values for the same key
    • Tests proper grouping of data

@beatrizsaldana beatrizsaldana self-assigned this May 19, 2025
@beatrizsaldana beatrizsaldana requested a review from a team as a code owner May 19, 2025 22:27
@beatrizsaldana beatrizsaldana requested a review from BWMac May 23, 2025 04:11
Copy link
Contributor

@BWMac BWMac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more comments but I'll pre-approve. Great work!

@beatrizsaldana
Copy link
Member Author

I'll wait to merge after @jaclynbeck-sage approves :)

Copy link

sonarqubecloud bot commented Jun 3, 2025

Copy link
Contributor

@jaclynbeck-sage jaclynbeck-sage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Sorry about the delay in reviewing :)

@beatrizsaldana beatrizsaldana merged commit 06bfc5c into dev Jun 10, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants