fco-dv · fco-dv · Nov 20, 2023 · Nov 20, 2023
diff --git a/README.md b/README.md
@@ -17,70 +17,138 @@
 
 ## Description
 This project aims at demonstring deep learning methodologies for EHR data. 
-The use case is to predict different outcomes for patients in the ICU. The dataset is from (MIMIC-IV demo) 
+The use case is to predict different outcomes for patients in the ICU. The dataset is from [MIMIC-IV demo](https://physionet.org/content/mimic-iv-demo/2.2/) containing de-identified health-related data from 140 patients admitted to critical care units. 
 
 ## Installation
-### With pip
+### Install the latest package version with pip
 ```bash
 pip3 install -U deepehrgraph
 ```
-
+### Commands
+Display available subcommands:
+```bash
+python3 -m deepehrgraph.main --help
+```
 
 ## Dataset
-(mimic iv demo dataset)[https://physionet.org/content/mimic-iv-demo/2.2/]
+The dataset description is the [mimic iv demo dataset](https://physionet.org/content/mimic-iv-demo/2.2/).
+
+This dataset contains de-identified health-related data from 140 patients admitted to critical care units. The data includes demographics, vital signs, laboratory tests, medications, and more. The data is stored in a relational database, and the data schema is described in the [MIMIC-IV documentation](https://mimic.mit.edu/docs/iv/).
+
+Note that we only have access to the [hosp](https://mimic.mit.edu/docs/iv/modules/hosp/) and [icu](https://mimic.mit.edu/docs/iv/modules/icu/) compressed files.
+
 ### Generate main dataset from compressed files
+
 ```bash
-python3 -m deepehrgraph.dataset.dataset_generator
+python -m deepehrgraph.main dataset
 ```
-This step will download the archive files from physionet and generate the master dataset in the `data` folder.
-CCI and ECI indexes are calculated and added to the dataset.
 
-### Features 
-In the context of medical studies, CCI (Charlson Comorbidity Index) and ECI (Elixhauser Comorbidity Index)
-are tools used to assess the burden of comorbidities in individuals.
-Comorbidities refer to the presence of additional health conditions in a patient alongside the primary
-condition under investigation. Both CCI and ECI are designed to quantify and summarize the impact of comorbidities on patient health.
+This step will download the archive files from physionet and generate the master dataset in the `data` folder by default as a csv file called `mimic_iv_demo_master_dataset.csv`.
+
+
+Several pre-computation steps are done in order to generate this master dataset:
+    - CCI and ECI indexes are calculated and added to the dataset.
+    - Outcomes for patients are calculed and added to the date.
+
+
+These pre-computations have been adapted from this [repository](https://github.com/nliulab/mimic4ed-benchmark) specifically for the MIMIC-IV demo dataset.
+
+Categorical features are identified and encoded with [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).
+
+In the context of medical studies, CCI (Charlson Comorbidity Index) and ECI (Elixhauser Comorbidity Index) are tools used to assess the burden of comorbidities in individuals.
+Comorbidities refer to the presence of additional health conditions in a patient alongside the primary condition under investigation. Both CCI and ECI are designed to quantify and summarize the impact of comorbidities on patient health. **These features seem to be good candidates for our prediction task.**
+
+
+### EDA : Features and oucomes analysis
+Run simple EDA with this command and you will get:
+-  basic information about dataset datatypes
+-  missing values count
+-  correlation matrix
+-  outcomes distribution
+
+```bash
+python -m deepehrgraph.main eda
+```
+
+`Correlation Matrix`
+![Correlation Matrix](assets/correlation_matrix.png)
+
+
+Here are the 25 top correlated features:
+```
+| Variable1        | Variable2        | Correlation |
+|------------------|------------------|-------------|
+| cci_Renal        | eci_Renal        | 1.000000    |
+| cci_Rheumatic    | eci_Rheumatic    | 1.000000    |
+| n_ed_365d        | n_icu_365d       | 1.000000    |
+| cci_Cancer2      | eci_Tumor2       | 1.000000    |
+| cci_Paralysis    | eci_Paralysis    | 1.000000    |
+| n_ed_30d         | n_icu_30d        | 1.000000    |
+| cci_PUD          | eci_PUD          | 1.000000    |
+| n_ed_90d         | n_icu_90d        | 1.000000    |
+| cci_Pulmonary    | eci_Pulmonary    | 1.000000    |
+| cci_CHF          | eci_CHF          | 1.000000    |
+| cci_Dementia     | cci_Paralysis    | 1.000000    |
+| cci_PVD          | eci_PVD          | 1.000000    |
+| cci_Dementia     | eci_Paralysis    | 1.000000    |
+| cci_DM1          | eci_DM2          | 0.971825    |
+| cci_Cancer1      | eci_Tumor1       | 0.949788    |
+| cci_DM2          | eci_DM1          | 0.931891    |
+| n_icu_90d        | n_icu_365d       | 0.927516    |
+| n_ed_90d         | n_icu_365d       | 0.927516    |
+| n_ed_365d        | n_icu_90d        | 0.927516    |
+| n_ed_90d         | n_ed_365d        | 0.927516    |
+| cci_Liver1       | eci_Liver        | 0.875261    |
+| eci_HTN1         | eci_Renal        | 0.815725    |
+| cci_Renal        | eci_HTN1         | 0.815725    |
+| n_hosp_30d       | n_hosp_90d       | 0.807012    |
+| n_ed_30d         | n_ed_90d         | 0.795026    |
+```
+
+Some features are highly correlated which could lead to poor model performance:
+- instability that make difficult to interpret the individual impact of each variable on the target.
+- model instability, increased sensitivity to small changes in the data
+- overfitting
+
+We will try to address this situation by using feature selection techniques.
+
+`Outcomes Repartition`
+![Outcomes Repartition](assets/outcomes_repartition.png)
+
+Based on these first results we will try to predict the following outcome: `in-hospital mortality`. 
+Note that we are facing an outcome class imbalance problem which can result in poor results while trying to predict this outcome, we will need to add a pre-processing for that.
 
-Charlson Comorbidity Index (CCI):
+## PCA for Collinear Feature Reduction
 
-Purpose: Developed by Dr. Mary Charlson, the CCI is a widely used tool to predict the 10-year mortality for patients with multiple comorbidities. It assigns weights to various comorbid conditions based on their impact on mortality.
-Calculation: Each comorbid condition is assigned a score, and the total CCI score is the sum of these individual scores. The higher the CCI score, the greater the burden of comorbidities.
-Conditions: The CCI includes conditions such as myocardial infarction, heart failure, dementia, diabetes, liver disease, and others.
+### Overview
 
-Elixhauser Comorbidity Index (ECI):
+Principal Component Analysis (PCA) is a technique for reducing the dimensionality of your dataset while retaining most of its original information. It's particularly useful for handling collinear features and improving model performance.
 
-Purpose: The ECI, developed by Dr. Claudia Elixhauser, is another comorbidity index used to assess the impact of comorbid conditions on healthcare outcomes. It is often employed in administrative databases and research studies.
-Calculation: Similar to the CCI, the ECI assigns weights to comorbid conditions. However, the ECI covers a broader range of conditions and is often used for risk adjustment in research studies.
-Conditions: The ECI includes a comprehensive list of conditions such as hypertension, obesity, renal failure, coagulopathy, and others.
+### Understanding Cumulative Variance
 
+- **Explained Variance Ratio:**
+  - Each principal component explains a certain proportion of the total variance. The cumulative explained variance is the sum of these individual variances, representing the overall information retained.
 
-Selected features: 
-
-    ['gender', 'age', 'n_ed_30d', 'n_ed_90d', 'n_ed_365d', 'n_hosp_30d',
-       'n_hosp_90d', 'n_hosp_365d', 'n_icu_30d', 'n_icu_90d', 'n_icu_365d',
-       'cci_MI', 'cci_CHF', 'cci_PVD', 'cci_Stroke', 'cci_Dementia',
-       'cci_Pulmonary', 'cci_Rheumatic', 'cci_PUD', 'cci_Liver1', 'cci_DM1',
-       'cci_DM2', 'cci_Paralysis', 'cci_Renal', 'cci_Cancer1', 'cci_Liver2',
-       'cci_Cancer2', 'cci_HIV', 'eci_CHF', 'eci_Arrhythmia', 'eci_Valvular',
-       'eci_PHTN', 'eci_PVD', 'eci_HTN1', 'eci_HTN2', 'eci_Paralysis',
-       'eci_NeuroOther', 'eci_Pulmonary', 'eci_DM1', 'eci_DM2',
-       'eci_Hypothyroid', 'eci_Renal', 'eci_Liver', 'eci_PUD', 'eci_HIV',
-       'eci_Lymphoma', 'eci_Tumor2', 'eci_Tumor1', 'eci_Rheumatic',
-       'eci_Coagulopathy', 'eci_Obesity', 'eci_WeightLoss', 'eci_FluidsLytes',
-       'eci_BloodLoss', 'eci_Anemia', 'eci_Alcohol', 'eci_Drugs',
-       'eci_Psychoses', 'eci_Depression']
+- **Choosing Components:**
+  - Decide on the number of components based on the desired cumulative explained variance (e.g., 95%). This choice balances dimensionality reduction with information preservation.
 
+`Cumulative Explained Variance`
+![Cumulative Explained Variance](assets/explained_variance.png)
 
-### Outcomes
+### Expectations for Final Correlation Matrix
 
-### Data preprocessing
+- **Correlation Among Features:**
+  - The final correlation matrix of the PCA result features ideally shows reduced correlations between features. Principal components are designed to be orthogonal, minimizing multicollinearity.
 
-### Feature selection
+- **Near-Zero Correlations:**
+  - Aim for near-zero correlations in the PCA result features, indicating that each component captures unique information.
 
+`Correlation Matrix PCA`
+![Correlation Matrix PCA](assets/correlation_pca.png)
 
-## Use Case
+**37 features will be selected for training the model.** This number is based on the cumulative explained variance of 95%.
 
-## Models
+## Models architecture
 
 
 ## Resources

diff --git a/assets/correlation_matrix.png b/assets/correlation_matrix.png
diff --git a/assets/correlation_pca.png b/assets/correlation_pca.png
diff --git a/assets/explained_variance.png b/assets/explained_variance.png
diff --git a/assets/outcomes_repartition.png b/assets/outcomes_repartition.png
diff --git a/deepehrgraph/dataset/eda.py b/deepehrgraph/dataset/eda.py
@@ -7,6 +7,7 @@
 
 from deepehrgraph.dataset.dataset import EHDRDataset
 from deepehrgraph.logger import get_logger
+from deepehrgraph.training.enums import OutcomeType
 
 logger = get_logger(__name__)
 
@@ -21,8 +22,7 @@ def _heatmap(correlation_matrix: pd.DataFrame):
 
 def _print_info(dataframe: pd.DataFrame):
     """Print info about the dataframe."""
-    logger.info(dataframe.describe())
-    logger.info(dataframe.dtypes)
+    logger.info(dataframe.info())
 
 
 def _display_correlation_matrix(dataframe: pd.DataFrame):
@@ -31,13 +31,31 @@ def _display_correlation_matrix(dataframe: pd.DataFrame):
     _heatmap(correlation_matrix)
 
 
-def _display_linear_dependency(dataframe: pd.DataFrame, feat1: str, feat2: str) -> None:
-    """Display linear dependence between two features."""
-    plt.figure(figsize=(10, 8))
-    sns.regplot(x=feat1, y=feat2, data=dataframe)
-    plt.ylim(
-        0,
-    )
+def _get_redundant_pairs(dataframe):
+    """Get diagonal and lower triangular pairs of correlation matrix"""
+    pairs_to_drop = set()
+    cols = dataframe.columns
+    for i in range(0, dataframe.shape[1]):
+        for j in range(0, i + 1):
+            pairs_to_drop.add((cols[i], cols[j]))
+    return pairs_to_drop
+
+
+def _get_top_abs_correlations(dataframe, n=5):
+    """Get top absolute correlations"""
+    au_corr = dataframe.corr().abs().unstack()
+    labels_to_drop = _get_redundant_pairs(dataframe)
+    au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
+    return au_corr[0:n]
+
+
+def _display_outcomes_class_repartition(outcomes: pd.DataFrame):
+    """Display outcomes class repartition."""
+    fig, axes = plt.subplots(1, len(list(OutcomeType)), figsize=(15, 4))
+    for i, outcome in enumerate(list(OutcomeType)):
+        axes[i].hist(outcomes.astype(int)[outcome.value], bins=[0, 0.5, 1.5])
+        axes[i].set_title(f"{outcome.value}")
+    plt.tight_layout()
     plt.show()
 
 
@@ -48,9 +66,21 @@ def eda(namespace: argparse.Namespace) -> None:
 
     logger.info("Features info:")
     _print_info(ehr_dataset.features)
-    _display_correlation_matrix(ehr_dataset.features)
-
     logger.info("Outcomes info:")
     _print_info(ehr_dataset.outcomes)
 
-    _display_linear_dependency(ehr_dataset.features, "cci_Liver2", "cci_Liver1")
+    logger.info("Features Missing values count:")
+    logger.info(ehr_dataset.features.isnull().sum().sum())
+    logger.info("Outcomes Missing values count:")
+    logger.info(ehr_dataset.outcomes.isnull().sum().sum())
+
+    logger.info("Features Correlation matrix:")
+    _display_correlation_matrix(ehr_dataset.features)
+    logger.info("Compute Top of Correlation matrix:")
+
+    n = 5
+    top_abs_correlated = _get_top_abs_correlations(ehr_dataset.features, 25)
+    logger.info(f"Top {n} Correlated features : \n {top_abs_correlated}")
+
+    logger.info("Display outcomes class repartition:")
+    _display_outcomes_class_repartition(ehr_dataset.outcomes)
diff --git a/deepehrgraph/dataset/features_selection.py b/deepehrgraph/dataset/features_selection.py
@@ -1,11 +1,15 @@
 """Feature selection for the dataset."""
 import argparse
 
+import matplotlib.pyplot as plt
 import pandas as pd
+from sklearn.decomposition import PCA
 from sklearn.feature_selection import SelectKBest, chi2
 from sklearn.model_selection import train_test_split
+from sklearn.preprocessing import StandardScaler
 
 from deepehrgraph.dataset.dataset import EHDRDataset
+from deepehrgraph.dataset.eda import _display_correlation_matrix
 from deepehrgraph.logger import get_logger
 
 logger = get_logger(__name__)
@@ -35,8 +39,80 @@ def select_kbest_features(
     return selected_feature_names
 
 
+def _reduce_colinear_features(
+    features, desired_explained_variance=0.95, display_plot=True
+):
+    """
+    Reduce colinear features using PCA.
+
+    Parameters:
+    - features (pd.DataFrame): Input DataFrame containing
+    features and target variable.
+    - desired_explained_variance (float): Desired cumulative
+      explained variance threshold (default is 0.95).
+    - display_plot (bool): Whether to display the cumulative
+      explained variance plot (default is True).
+
+    Returns:
+    - X_pca_retained (pd.DataFrame): Transformed DataFrame with retained components.
+    """
+
+    # Standardize the features
+    scaler = StandardScaler()
+    X_scaled = scaler.fit_transform(features)
+
+    # Perform PCA
+    pca = PCA()
+    pca.fit_transform(X_scaled)
+
+    # Find the number of components that meet or exceed the desired explained variance
+    cumulative_explained_variance = pca.explained_variance_ratio_.cumsum()
+    num_components_to_retain = sum(
+        cumulative_explained_variance >= desired_explained_variance
+    )
+
+    # Retain components
+    pca = PCA(n_components=num_components_to_retain)
+    X_pca_retained = pca.fit_transform(X_scaled)
+
+    # Display the result
+    logger.info(
+        f"Number of components to retain for \
+        {desired_explained_variance * 100}% explained \
+        variance: {num_components_to_retain}"
+    )
+
+    # Plotting explained variance ratio if display_plot is True
+    if display_plot:
+        plt.plot(
+            range(1, len(cumulative_explained_variance) + 1),
+            cumulative_explained_variance,
+            marker="o",
+            linestyle="--",
+        )
+        plt.title("Cumulative Explained Variance")
+        plt.xlabel("Number of Principal Components")
+        plt.ylabel("Cumulative Explained Variance")
+        plt.show()
+
+    return pd.DataFrame(
+        X_pca_retained, columns=[f"PC{i+1}" for i in range(num_components_to_retain)]
+    )
+
+
 def features_selection(namespace: argparse.Namespace) -> None:
     """Exploratory Data Analysis (EDA) for the MIMIC-IV demo dataset."""
+
+    logger.info(f"Features selection phase : {namespace}")
+
+    logger.info("Load EHRDataset:")
     ehr_dataset = EHDRDataset(download=False)
 
-    select_kbest_features(ehr_dataset, 10, "outcome_inhospital_mortality")
+    reduced_features = _reduce_colinear_features(
+        features=ehr_dataset.features,
+        desired_explained_variance=0.95,
+        display_plot=True,
+    )
+
+    logger.info("Display correlation matrix on selected features:")
+    _display_correlation_matrix(reduced_features)
diff --git a/deepehrgraph/main.py b/deepehrgraph/main.py
@@ -45,6 +45,13 @@ def main() -> None:
         default="data",
         help="Directory name to store the dataset.",
     )
+    parser_feat_selection.add_argument(
+        "--desired-explained-variance",
+        type=float,
+        default=0.95,
+        help="Desired explained variance threshold (default is 0.95) \
+          for features reduction.",
+    )
     parser_feat_selection.set_defaults(func=features_selection)
 
     parser_train = subparsers.add_parser(