Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
148 changes: 108 additions & 40 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,70 +17,138 @@

## Description
This project aims at demonstring deep learning methodologies for EHR data.
The use case is to predict different outcomes for patients in the ICU. The dataset is from (MIMIC-IV demo)
The use case is to predict different outcomes for patients in the ICU. The dataset is from [MIMIC-IV demo](https://physionet.org/content/mimic-iv-demo/2.2/) containing de-identified health-related data from 140 patients admitted to critical care units.

## Installation
### With pip
### Install the latest package version with pip
```bash
pip3 install -U deepehrgraph
```

### Commands
Display available subcommands:
```bash
python3 -m deepehrgraph.main --help
```

## Dataset
(mimic iv demo dataset)[https://physionet.org/content/mimic-iv-demo/2.2/]
The dataset description is the [mimic iv demo dataset](https://physionet.org/content/mimic-iv-demo/2.2/).

This dataset contains de-identified health-related data from 140 patients admitted to critical care units. The data includes demographics, vital signs, laboratory tests, medications, and more. The data is stored in a relational database, and the data schema is described in the [MIMIC-IV documentation](https://mimic.mit.edu/docs/iv/).

Note that we only have access to the [hosp](https://mimic.mit.edu/docs/iv/modules/hosp/) and [icu](https://mimic.mit.edu/docs/iv/modules/icu/) compressed files.

### Generate main dataset from compressed files

```bash
python3 -m deepehrgraph.dataset.dataset_generator
python -m deepehrgraph.main dataset
```
This step will download the archive files from physionet and generate the master dataset in the `data` folder.
CCI and ECI indexes are calculated and added to the dataset.

### Features
In the context of medical studies, CCI (Charlson Comorbidity Index) and ECI (Elixhauser Comorbidity Index)
are tools used to assess the burden of comorbidities in individuals.
Comorbidities refer to the presence of additional health conditions in a patient alongside the primary
condition under investigation. Both CCI and ECI are designed to quantify and summarize the impact of comorbidities on patient health.
This step will download the archive files from physionet and generate the master dataset in the `data` folder by default as a csv file called `mimic_iv_demo_master_dataset.csv`.


Several pre-computation steps are done in order to generate this master dataset:
- CCI and ECI indexes are calculated and added to the dataset.
- Outcomes for patients are calculed and added to the date.


These pre-computations have been adapted from this [repository](https://github.com/nliulab/mimic4ed-benchmark) specifically for the MIMIC-IV demo dataset.

Categorical features are identified and encoded with [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

In the context of medical studies, CCI (Charlson Comorbidity Index) and ECI (Elixhauser Comorbidity Index) are tools used to assess the burden of comorbidities in individuals.
Comorbidities refer to the presence of additional health conditions in a patient alongside the primary condition under investigation. Both CCI and ECI are designed to quantify and summarize the impact of comorbidities on patient health. **These features seem to be good candidates for our prediction task.**


### EDA : Features and oucomes analysis
Run simple EDA with this command and you will get:
- basic information about dataset datatypes
- missing values count
- correlation matrix
- outcomes distribution

```bash
python -m deepehrgraph.main eda
```

`Correlation Matrix`
![Correlation Matrix](assets/correlation_matrix.png)


Here are the 25 top correlated features:
```
| Variable1 | Variable2 | Correlation |
|------------------|------------------|-------------|
| cci_Renal | eci_Renal | 1.000000 |
| cci_Rheumatic | eci_Rheumatic | 1.000000 |
| n_ed_365d | n_icu_365d | 1.000000 |
| cci_Cancer2 | eci_Tumor2 | 1.000000 |
| cci_Paralysis | eci_Paralysis | 1.000000 |
| n_ed_30d | n_icu_30d | 1.000000 |
| cci_PUD | eci_PUD | 1.000000 |
| n_ed_90d | n_icu_90d | 1.000000 |
| cci_Pulmonary | eci_Pulmonary | 1.000000 |
| cci_CHF | eci_CHF | 1.000000 |
| cci_Dementia | cci_Paralysis | 1.000000 |
| cci_PVD | eci_PVD | 1.000000 |
| cci_Dementia | eci_Paralysis | 1.000000 |
| cci_DM1 | eci_DM2 | 0.971825 |
| cci_Cancer1 | eci_Tumor1 | 0.949788 |
| cci_DM2 | eci_DM1 | 0.931891 |
| n_icu_90d | n_icu_365d | 0.927516 |
| n_ed_90d | n_icu_365d | 0.927516 |
| n_ed_365d | n_icu_90d | 0.927516 |
| n_ed_90d | n_ed_365d | 0.927516 |
| cci_Liver1 | eci_Liver | 0.875261 |
| eci_HTN1 | eci_Renal | 0.815725 |
| cci_Renal | eci_HTN1 | 0.815725 |
| n_hosp_30d | n_hosp_90d | 0.807012 |
| n_ed_30d | n_ed_90d | 0.795026 |
```

Some features are highly correlated which could lead to poor model performance:
- instability that make difficult to interpret the individual impact of each variable on the target.
- model instability, increased sensitivity to small changes in the data
- overfitting

We will try to address this situation by using feature selection techniques.

`Outcomes Repartition`
![Outcomes Repartition](assets/outcomes_repartition.png)

Based on these first results we will try to predict the following outcome: `in-hospital mortality`.
Note that we are facing an outcome class imbalance problem which can result in poor results while trying to predict this outcome, we will need to add a pre-processing for that.

Charlson Comorbidity Index (CCI):
## PCA for Collinear Feature Reduction

Purpose: Developed by Dr. Mary Charlson, the CCI is a widely used tool to predict the 10-year mortality for patients with multiple comorbidities. It assigns weights to various comorbid conditions based on their impact on mortality.
Calculation: Each comorbid condition is assigned a score, and the total CCI score is the sum of these individual scores. The higher the CCI score, the greater the burden of comorbidities.
Conditions: The CCI includes conditions such as myocardial infarction, heart failure, dementia, diabetes, liver disease, and others.
### Overview

Elixhauser Comorbidity Index (ECI):
Principal Component Analysis (PCA) is a technique for reducing the dimensionality of your dataset while retaining most of its original information. It's particularly useful for handling collinear features and improving model performance.

Purpose: The ECI, developed by Dr. Claudia Elixhauser, is another comorbidity index used to assess the impact of comorbid conditions on healthcare outcomes. It is often employed in administrative databases and research studies.
Calculation: Similar to the CCI, the ECI assigns weights to comorbid conditions. However, the ECI covers a broader range of conditions and is often used for risk adjustment in research studies.
Conditions: The ECI includes a comprehensive list of conditions such as hypertension, obesity, renal failure, coagulopathy, and others.
### Understanding Cumulative Variance

- **Explained Variance Ratio:**
- Each principal component explains a certain proportion of the total variance. The cumulative explained variance is the sum of these individual variances, representing the overall information retained.

Selected features:

['gender', 'age', 'n_ed_30d', 'n_ed_90d', 'n_ed_365d', 'n_hosp_30d',
'n_hosp_90d', 'n_hosp_365d', 'n_icu_30d', 'n_icu_90d', 'n_icu_365d',
'cci_MI', 'cci_CHF', 'cci_PVD', 'cci_Stroke', 'cci_Dementia',
'cci_Pulmonary', 'cci_Rheumatic', 'cci_PUD', 'cci_Liver1', 'cci_DM1',
'cci_DM2', 'cci_Paralysis', 'cci_Renal', 'cci_Cancer1', 'cci_Liver2',
'cci_Cancer2', 'cci_HIV', 'eci_CHF', 'eci_Arrhythmia', 'eci_Valvular',
'eci_PHTN', 'eci_PVD', 'eci_HTN1', 'eci_HTN2', 'eci_Paralysis',
'eci_NeuroOther', 'eci_Pulmonary', 'eci_DM1', 'eci_DM2',
'eci_Hypothyroid', 'eci_Renal', 'eci_Liver', 'eci_PUD', 'eci_HIV',
'eci_Lymphoma', 'eci_Tumor2', 'eci_Tumor1', 'eci_Rheumatic',
'eci_Coagulopathy', 'eci_Obesity', 'eci_WeightLoss', 'eci_FluidsLytes',
'eci_BloodLoss', 'eci_Anemia', 'eci_Alcohol', 'eci_Drugs',
'eci_Psychoses', 'eci_Depression']
- **Choosing Components:**
- Decide on the number of components based on the desired cumulative explained variance (e.g., 95%). This choice balances dimensionality reduction with information preservation.

`Cumulative Explained Variance`
![Cumulative Explained Variance](assets/explained_variance.png)

### Outcomes
### Expectations for Final Correlation Matrix

### Data preprocessing
- **Correlation Among Features:**
- The final correlation matrix of the PCA result features ideally shows reduced correlations between features. Principal components are designed to be orthogonal, minimizing multicollinearity.

### Feature selection
- **Near-Zero Correlations:**
- Aim for near-zero correlations in the PCA result features, indicating that each component captures unique information.

`Correlation Matrix PCA`
![Correlation Matrix PCA](assets/correlation_pca.png)

## Use Case
**37 features will be selected for training the model.** This number is based on the cumulative explained variance of 95%.

## Models
## Models architecture


## Resources
Expand Down
Binary file added assets/correlation_matrix.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/correlation_pca.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/explained_variance.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/outcomes_repartition.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
54 changes: 42 additions & 12 deletions deepehrgraph/dataset/eda.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@

from deepehrgraph.dataset.dataset import EHDRDataset
from deepehrgraph.logger import get_logger
from deepehrgraph.training.enums import OutcomeType

logger = get_logger(__name__)

Expand All @@ -21,8 +22,7 @@ def _heatmap(correlation_matrix: pd.DataFrame):

def _print_info(dataframe: pd.DataFrame):
"""Print info about the dataframe."""
logger.info(dataframe.describe())
logger.info(dataframe.dtypes)
logger.info(dataframe.info())


def _display_correlation_matrix(dataframe: pd.DataFrame):
Expand All @@ -31,13 +31,31 @@ def _display_correlation_matrix(dataframe: pd.DataFrame):
_heatmap(correlation_matrix)


def _display_linear_dependency(dataframe: pd.DataFrame, feat1: str, feat2: str) -> None:
"""Display linear dependence between two features."""
plt.figure(figsize=(10, 8))
sns.regplot(x=feat1, y=feat2, data=dataframe)
plt.ylim(
0,
)
def _get_redundant_pairs(dataframe):
"""Get diagonal and lower triangular pairs of correlation matrix"""
pairs_to_drop = set()
cols = dataframe.columns
for i in range(0, dataframe.shape[1]):
for j in range(0, i + 1):
pairs_to_drop.add((cols[i], cols[j]))
return pairs_to_drop


def _get_top_abs_correlations(dataframe, n=5):
"""Get top absolute correlations"""
au_corr = dataframe.corr().abs().unstack()
labels_to_drop = _get_redundant_pairs(dataframe)
au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
return au_corr[0:n]


def _display_outcomes_class_repartition(outcomes: pd.DataFrame):
"""Display outcomes class repartition."""
fig, axes = plt.subplots(1, len(list(OutcomeType)), figsize=(15, 4))
for i, outcome in enumerate(list(OutcomeType)):
axes[i].hist(outcomes.astype(int)[outcome.value], bins=[0, 0.5, 1.5])
axes[i].set_title(f"{outcome.value}")
plt.tight_layout()
plt.show()


Expand All @@ -48,9 +66,21 @@ def eda(namespace: argparse.Namespace) -> None:

logger.info("Features info:")
_print_info(ehr_dataset.features)
_display_correlation_matrix(ehr_dataset.features)

logger.info("Outcomes info:")
_print_info(ehr_dataset.outcomes)

_display_linear_dependency(ehr_dataset.features, "cci_Liver2", "cci_Liver1")
logger.info("Features Missing values count:")
logger.info(ehr_dataset.features.isnull().sum().sum())
logger.info("Outcomes Missing values count:")
logger.info(ehr_dataset.outcomes.isnull().sum().sum())

logger.info("Features Correlation matrix:")
_display_correlation_matrix(ehr_dataset.features)
logger.info("Compute Top of Correlation matrix:")

n = 5
top_abs_correlated = _get_top_abs_correlations(ehr_dataset.features, 25)
logger.info(f"Top {n} Correlated features : \n {top_abs_correlated}")

logger.info("Display outcomes class repartition:")
_display_outcomes_class_repartition(ehr_dataset.outcomes)
78 changes: 77 additions & 1 deletion deepehrgraph/dataset/features_selection.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,15 @@
"""Feature selection for the dataset."""
import argparse

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from deepehrgraph.dataset.dataset import EHDRDataset
from deepehrgraph.dataset.eda import _display_correlation_matrix
from deepehrgraph.logger import get_logger

logger = get_logger(__name__)
Expand Down Expand Up @@ -35,8 +39,80 @@ def select_kbest_features(
return selected_feature_names


def _reduce_colinear_features(
features, desired_explained_variance=0.95, display_plot=True
):
"""
Reduce colinear features using PCA.

Parameters:
- features (pd.DataFrame): Input DataFrame containing
features and target variable.
- desired_explained_variance (float): Desired cumulative
explained variance threshold (default is 0.95).
- display_plot (bool): Whether to display the cumulative
explained variance plot (default is True).

Returns:
- X_pca_retained (pd.DataFrame): Transformed DataFrame with retained components.
"""

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(features)

# Perform PCA
pca = PCA()
pca.fit_transform(X_scaled)

# Find the number of components that meet or exceed the desired explained variance
cumulative_explained_variance = pca.explained_variance_ratio_.cumsum()
num_components_to_retain = sum(
cumulative_explained_variance >= desired_explained_variance
)

# Retain components
pca = PCA(n_components=num_components_to_retain)
X_pca_retained = pca.fit_transform(X_scaled)

# Display the result
logger.info(
f"Number of components to retain for \
{desired_explained_variance * 100}% explained \
variance: {num_components_to_retain}"
)

# Plotting explained variance ratio if display_plot is True
if display_plot:
plt.plot(
range(1, len(cumulative_explained_variance) + 1),
cumulative_explained_variance,
marker="o",
linestyle="--",
)
plt.title("Cumulative Explained Variance")
plt.xlabel("Number of Principal Components")
plt.ylabel("Cumulative Explained Variance")
plt.show()

return pd.DataFrame(
X_pca_retained, columns=[f"PC{i+1}" for i in range(num_components_to_retain)]
)


def features_selection(namespace: argparse.Namespace) -> None:
"""Exploratory Data Analysis (EDA) for the MIMIC-IV demo dataset."""

logger.info(f"Features selection phase : {namespace}")

logger.info("Load EHRDataset:")
ehr_dataset = EHDRDataset(download=False)

select_kbest_features(ehr_dataset, 10, "outcome_inhospital_mortality")
reduced_features = _reduce_colinear_features(
features=ehr_dataset.features,
desired_explained_variance=0.95,
display_plot=True,
)

logger.info("Display correlation matrix on selected features:")
_display_correlation_matrix(reduced_features)
7 changes: 7 additions & 0 deletions deepehrgraph/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,13 @@ def main() -> None:
default="data",
help="Directory name to store the dataset.",
)
parser_feat_selection.add_argument(
"--desired-explained-variance",
type=float,
default=0.95,
help="Desired explained variance threshold (default is 0.95) \
for features reduction.",
)
parser_feat_selection.set_defaults(func=features_selection)

parser_train = subparsers.add_parser(
Expand Down
Loading