Model AD: Disease Correlation Data Transformation #179

beatrizsaldana · 2025-05-19T22:27:32Z

Problem

The disease correlation data transformation process needed a robust implementation to handle the transformation of disease correlation results, model information, and allele information into a standardized format. The transformation needed to handle various edge cases and ensure data consistency while processing multiple mouse models and their associated genetic information. Jira ticket: MG-44

Solution

Implemented a data transformation pipeline that:

Processes disease correlation results from multiple mouse models
Integrates model and allele information for each model
Handles data validation and error cases
Transforms the data into expected format
Created a utility function to validate inputs and edited model_details.py and immunohisto_transform.py to use the validation function
Added other utility functions that seemed useful outside of this specific transform

New Utility Functions

Added several utility functions to support data transformation and validation:

check_required_datasets_and_columns
- Validates the presence of required datasets and their columns
- Raises descriptive ValueError if any required datasets or columns are missing
- Used to ensure data completeness before processing
flatten_list
- Recursively flattens nested lists into a single-level list
- Handles arbitrary depth of nesting
- Preserves all non-list elements in their original order
- Useful for processing nested data structures
remove_duplicates_keep_order
- Removes duplicate elements from a list while preserving original order
- Handles mixed data types
- Used for deduplication of gene lists
create_lookup
- Creates a nested dictionary lookup from a pandas DataFrame
- Groups data by a specified column
- Handles multiple values for the same key by creating lists
- Useful for efficient data lookups and transformations

Test

Disease Correlation Tests

Basic Valid Input
- Tests the transformation of standard input data with multiple models
- Verifies correct handling of cluster information, modules, and correlation values
- Validates proper type conversion of correlation and p-values to float
- Ensures correct grouping of results by model, age, and sex
Duplicate Results Handling
- Tests the system's ability to handle duplicate entries in disease correlation results
- Verifies that duplicate results are properly processed without data loss
Duplicate Allele Information
- Tests handling of duplicate gene entries in allele information
- Ensures proper deduplication of gene information

Error Test Cases

Dataset Validation
- Tests for missing required datasets (e.g., model_info)
- Verifies detection of inconsistent model information
- Ensures proper error messages for missing datasets
Column Validation
- Tests for missing required columns in input datasets
- Validates proper error handling for incomplete data structures

Utility Function Tests

check_required_datasets_and_columns Tests
- Tests successful validation with all required datasets and columns
- Tests error handling for missing datasets
- Tests error handling for missing columns
flatten_list Tests
- Tests empty list handling
- Tests non-nested list handling
- Tests single-level nested list handling
- Tests multiple-level nested list handling
- Tests mixed data type handling
remove_duplicates_keep_order Tests
- Tests empty list handling
- Tests list with no duplicates
- Tests list with duplicates
- Tests mixed data type handling
- Tests special case of True/1 equality
- Tests order preservation
create_lookup Tests
- Tests creation of lookup dictionary from DataFrame
- Tests handling of multiple values for the same key
- Tests proper grouping of data

…ons and updated some transform functions to use them

…single function check_required_datasets_and_columns()

…e explicit

…DT-Disease-Correlation-ETL

src/agoradatatools/etl/utils.py

src/agoradatatools/etl/transform/disease_correlation.py

Replacing not == with !=. Co-authored-by: Brad Macdonald <52762200+BWMac@users.noreply.github.com>

…tps://github.com/Sage-Bionetworks/agora-data-tools into beatrizsaldana/MG-44/ADT-Disease-Correlation-ETL

…r functions as recommended by Brad

BWMac

Just a few more comments but I'll pre-approve. Great work!

src/agoradatatools/etl/transform/disease_correlation.py

src/agoradatatools/etl/utils.py

tests/transform/test_disease_correlation.py

beatrizsaldana · 2025-05-28T02:09:11Z

I'll wait to merge after @jaclynbeck-sage approves :)

src/agoradatatools/etl/transform/model_details.py

tests/test_assets/disease_correlation/output/disease_correlation.json

sonarqubecloud · 2025-06-03T22:04:52Z

Quality Gate passed

Issues
56 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

jaclynbeck-sage

Looks good! Sorry about the delay in reviewing :)

Beatriz Saldana added 11 commits May 8, 2025 14:10

Created check_required_datasets() and check_required_columns() functi…

7b18beb

…ons and updated some transform functions to use them

Completed disease_correlation transform

aeb2605

Formatting changes

e986818

Merged the required datasets and columns validation functions into a …

24fb53a

…single function check_required_datasets_and_columns()

Added tests for check_required_datasets_and_columns() in test_utils.py

73b4b1a

Improved transform_disease_correlation() docstring and comments

d24e950

Added more tests for transform_disease_correlation

273265c

Fixed lookup creation for disease_correlation

c171d34

Added test for create_lookup() function

d8036be

Properly managed values of type str that are actually lists

6a76e49

All disease correlation related tests are passing

affee90

beatrizsaldana self-assigned this May 19, 2025

beatrizsaldana requested a review from a team as a code owner May 19, 2025 22:27

Beatriz Saldana added 3 commits May 19, 2025 15:29

Added comment to test_flatten_list_mixed_types() to make behavior mor…

ccb4bbc

…e explicit

Merge remote-tracking branch 'origin/dev' into beatrizsaldana/MG-44/A…

12116a9

…DT-Disease-Correlation-ETL

Pre-commit

c28fe08

BWMac reviewed May 20, 2025

View reviewed changes

src/agoradatatools/etl/utils.py Outdated Show resolved Hide resolved

src/agoradatatools/etl/utils.py Outdated Show resolved Hide resolved

src/agoradatatools/etl/transform/disease_correlation.py Show resolved Hide resolved

beatrizsaldana and others added 5 commits May 20, 2025 12:40

Update src/agoradatatools/etl/utils.py

8ee88a0

Replacing not == with !=. Co-authored-by: Brad Macdonald <52762200+BWMac@users.noreply.github.com>

Moved create_lookup() out of utils and into disease_correlation

d9a28ee

Merge branch 'beatrizsaldana/MG-44/ADT-Disease-Correlation-ETL' of ht…

9cf56ce

…tps://github.com/Sage-Bionetworks/agora-data-tools into beatrizsaldana/MG-44/ADT-Disease-Correlation-ETL

Extracted code from transform_disease_correlation and put it in helpe…

24205dc

…r functions as recommended by Brad

Formatting changes

cc8a481

beatrizsaldana requested a review from BWMac May 23, 2025 04:11

BWMac approved these changes May 23, 2025

View reviewed changes

Beatriz Saldana added 6 commits May 27, 2025 18:28

Merging dev

88c282b

Cleaned up utils code

7955019

Added tests for input_validation_model_info()

6a50d7c

Cleaned up test formatting

2b0a4ac

Changed required column name from gene to modified_gene

f83daf5

Cleaned up code in model_details

9dfae4e

beatrizsaldana requested a review from jaclynbeck-sage May 28, 2025 01:55

jaclynbeck-sage reviewed May 28, 2025

View reviewed changes

src/agoradatatools/etl/transform/model_details.py Show resolved Hide resolved

jaclynbeck-sage reviewed May 28, 2025

View reviewed changes

tests/test_assets/disease_correlation/output/disease_correlation.json Outdated Show resolved Hide resolved

Beatriz Saldana added 4 commits June 2, 2025 18:06

Merging dev

d5614df

Using check_required_datasets_and_columns() in model_details

23c7fdf

Updating required input to match expectations

981fb3b

Addressing PR comments: made disease correlation test assets smaller

2989593

beatrizsaldana requested a review from jaclynbeck-sage June 3, 2025 22:07

jaclynbeck-sage approved these changes Jun 10, 2025

View reviewed changes

beatrizsaldana merged commit 06bfc5c into dev Jun 10, 2025
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Model AD: Disease Correlation Data Transformation #179

Model AD: Disease Correlation Data Transformation #179

Uh oh!

beatrizsaldana commented May 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BWMac left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

beatrizsaldana commented May 28, 2025

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud bot commented Jun 3, 2025

Uh oh!

jaclynbeck-sage left a comment

Uh oh!

Uh oh!

Uh oh!

Model AD: Disease Correlation Data Transformation #179

Model AD: Disease Correlation Data Transformation #179

Uh oh!

Conversation

beatrizsaldana commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

New Utility Functions

Test

Disease Correlation Tests

Error Test Cases

Utility Function Tests

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BWMac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

beatrizsaldana commented May 28, 2025

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud bot commented Jun 3, 2025

Quality Gate passed

Uh oh!

jaclynbeck-sage left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

beatrizsaldana commented May 19, 2025 •

edited

Loading