Author: Kevin Thomas
License: Apache 2.0
This report presents the development and evaluation of a binary classification model designed to predict heart attack occurrences within a national healthcare dataset. Leveraging data on demographics, health conditions, and preventive measures, the model aims to enhance insights into population health, highlighting key risk factors associated with heart attacks.
1. Continuous Variables Analysis
- Height and Weight Distributions: HeightInMeters approximates a normal distribution, while WeightInKilograms skews slightly towards higher values. BMI exhibits a right-skew, indicating a predominance of healthy to overweight individuals, with fewer in the higher BMI ranges.
- Prevalence of Health Conditions: Binary health conditions, such as HadAngina and HadStroke, are generally rare, with most individuals unaffected. However, conditions like HadArthritis, AlcoholDrinkers, and ChestScan show higher prevalence, indicating they are more commonly observed.
- Vaccine and Preventive Measures: Variables like FluVaxLast12, PneumoVaxEver, HighRiskLastYear, and CovidPos suggest that most individuals have not experienced these events, with a small proportion indicating positive cases.
2. Categorical Variables Analysis
- Geographic Distribution: The “Count of State” chart shows a wide geographic distribution of respondents, with higher counts in populous states like California and Texas.
- Gender Representation: The distribution between “Male” and “Female” is nearly balanced.
- General Health Ratings: Most respondents rate their health as “Very Good” or “Good,” with fewer reporting “Poor” health, suggesting a largely healthy population.
- Age Distribution: Middle-aged and senior groups are well-represented, reflecting a slight skew towards older age categories.
- Diabetes and Smoking Status: A majority report not having diabetes, though a notable segment does. Smoking data shows that most respondents have “Never smoked,” with e-cigarette usage relatively low.
- Race and Preventive Care: “White only, Non-Hispanic” is the predominant racial group. Tetanus vaccination data suggests some gaps in preventive care awareness, as many respondents are unsure about their shot type.
3. Advanced Visualizations
- Demographic and Health Patterns: State-level data shows variation in respondent counts, with larger numbers in states like Washington and Texas. Health ratings skew towards “Very Good” or “Good.” Age distribution favors older groups, particularly those aged 60-79, with diabetes prevalence increasing in these age ranges.
- Smoking and Lifestyle Factors: A majority report “Never smoked,” with low e-cigarette usage. Tetanus vaccination records are mixed, suggesting a need for improved awareness in preventive care. Racial distribution is mostly “White, Non-Hispanic,” with smaller segments representing other groups.
- Self-Reported Health and Health Conditions: Conditions like asthma, COPD, and depressive disorders are associated with poorer self-rated health, as are functional limitations (e.g., difficulty walking, dressing).
Model Evaluation
- Best Model: Achieved an accuracy of 94.74% and an ROC AUC of 0.8861 with 111 coefficients, indicating strong performance and predictive robustness. Key variables include:
- Geographic Influence: States such as South Dakota (0.5115, p = 0.0001) and Maine (0.4423, p = 0.0004) show higher probabilities of positive outcomes.
- Age Factor: Age is a strong predictor, with the oldest groups (80+ with a coefficient of 2.4317, p < 0.0001) having the highest likelihoods.
- Health Conditions: Variables like HadAngina (2.4228, p < 0.0001) and HadStroke (0.8598, p < 0.0001) show strong positive associations.
- Lifestyle and Preventive Measures: Smoking status and flu vaccination last year show negative associations, while chest scans and pneumococcal vaccine uptake contribute positively.
The best model demonstrates high predictive accuracy and effectively highlights associations between age, health conditions, and preventive measures with health outcomes. Its reliability and robust feature set make it suitable for population health monitoring and public health interventions.
- Address Class Imbalance: Implement resampling or algorithmic techniques to mitigate class imbalance.
- Feature Engineering: Investigate potential interactions among demographic and health variables.
- External Validation: Validate the model with external datasets for broader applicability.
- Public Health Deployment: Deploy the model within health systems for monitoring and intervention support.
The development of this model underscores the value of health data analytics in public health strategies. By addressing data challenges and refining feature selection, this model holds promise for enhancing resource allocation, preventive care efforts, and health outcome predictions across diverse populations.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
import statsmodels.formula.api as smf
df = pd.read_excel('Patients Data ( Used for Heart Disease Prediction ).xlsx')
df.shape
(237630, 35)
df.dtypes
PatientID int64
State object
Sex object
GeneralHealth object
AgeCategory object
HeightInMeters float64
WeightInKilograms float64
BMI float64
HadHeartAttack int64
HadAngina int64
HadStroke int64
HadAsthma int64
HadSkinCancer int64
HadCOPD int64
HadDepressiveDisorder int64
HadKidneyDisease int64
HadArthritis int64
HadDiabetes object
DeafOrHardOfHearing int64
BlindOrVisionDifficulty int64
DifficultyConcentrating int64
DifficultyWalking int64
DifficultyDressingBathing int64
DifficultyErrands int64
SmokerStatus object
ECigaretteUsage object
ChestScan int64
RaceEthnicityCategory object
AlcoholDrinkers int64
HIVTesting int64
FluVaxLast12 int64
PneumoVaxEver int64
TetanusLast10Tdap object
HighRiskLastYear int64
CovidPos int64
dtype: object
_ = [print(f'{df[column].value_counts()}\n') for column in df.columns]
PatientID
1 1
158425 1
158413 1
158414 1
158415 1
..
79215 1
79216 1
79217 1
79218 1
237630 1
Name: count, Length: 237630, dtype: int64
State
Washington 14241
Maryland 8817
Minnesota 8712
Ohio 8700
New York 8625
Texas 7267
Florida 7124
Kansas 6000
Wisconsin 5890
Maine 5709
Iowa 5492
Indiana 5393
South Carolina 5360
Virginia 5358
Arizona 5302
Hawaii 5262
Utah 5212
Michigan 5206
Massachusetts 5164
Nebraska 5008
Colorado 4973
Georgia 4860
California 4801
Connecticut 4765
Vermont 4569
South Dakota 4280
Montana 4155
Missouri 4042
New Jersey 3833
New Hampshire 3564
Puerto Rico 3550
Idaho 3394
Alaska 3100
Rhode Island 3006
Louisiana 2958
Oregon 2907
Oklahoma 2899
West Virginia 2898
New Mexico 2894
Arkansas 2881
Tennessee 2659
Pennsylvania 2623
North Carolina 2531
Illinois 2516
North Dakota 2435
Mississippi 2416
Kentucky 2381
Wyoming 2345
Delaware 2079
Alabama 1880
Nevada 1693
District of Columbia 1648
Guam 1521
Virgin Islands 732
Name: count, dtype: int64
Sex
Female 123293
Male 114337
Name: count, dtype: int64
GeneralHealth
Very good 83520
Good 74950
Excellent 39911
Fair 29965
Poor 9284
Name: count, dtype: int64
AgeCategory
Age 65 to 69 27547
Age 60 to 64 25685
Age 70 to 74 24946
Age 55 to 59 21422
Age 50 to 54 19154
Age 75 to 79 17679
Age 80 or older 17544
Age 40 to 44 16228
Age 45 to 49 16095
Age 35 to 39 14982
Age 30 to 34 12825
Age 18 to 24 12777
Age 25 to 29 10746
Name: count, dtype: int64
HeightInMeters
1.68 20909
1.63 20104
1.70 19356
1.78 18828
1.65 18197
...
1.92 1
1.44 1
1.18 1
1.09 1
2.21 1
Name: count, Length: 101, dtype: int64
WeightInKilograms
90.720001 12677
81.650002 11603
68.040001 10026
72.570000 9987
77.110001 9370
...
166.470001 1
205.020004 1
178.259995 1
205.479996 1
30.389999 1
Name: count, Length: 513, dtype: int64
BMI
26.629999 2612
27.459999 1965
27.440001 1896
24.410000 1861
27.120001 1831
...
27.420000 1
20.570000 1
29.219999 1
52.340000 1
45.279999 1
Name: count, Length: 3503, dtype: int64
HadHeartAttack
0 224429
1 13201
Name: count, dtype: int64
HadAngina
0 223013
1 14617
Name: count, dtype: int64
HadStroke
0 227702
1 9928
Name: count, dtype: int64
HadAsthma
0 202338
1 35292
Name: count, dtype: int64
HadSkinCancer
0 217378
1 20252
Name: count, dtype: int64
HadCOPD
0 219028
1 18602
Name: count, dtype: int64
HadDepressiveDisorder
0 188734
1 48896
Name: count, dtype: int64
HadKidneyDisease
0 226601
1 11029
Name: count, dtype: int64
HadArthritis
0 155281
1 82349
Name: count, dtype: int64
HadDiabetes
No 197463
Yes 33055
No, pre-diabetes or borderline diabetes 5211
Yes, but only during pregnancy (female) 1901
Name: count, dtype: int64
DeafOrHardOfHearing
0 217090
1 20540
Name: count, dtype: int64
BlindOrVisionDifficulty
0 225643
1 11987
Name: count, dtype: int64
DifficultyConcentrating
0 212129
1 25501
Name: count, dtype: int64
DifficultyWalking
0 202239
1 35391
Name: count, dtype: int64
DifficultyDressingBathing
0 229426
1 8204
Name: count, dtype: int64
DifficultyErrands
0 221574
1 16056
Name: count, dtype: int64
SmokerStatus
Never smoked 142390
Former smoker 66193
Current smoker - now smokes every day 21148
Current smoker - now smokes some days 7899
Name: count, dtype: int64
ECigaretteUsage
Never used e-cigarettes in my entire life 183446
Not at all (right now) 41963
Use them some days 6468
Use them every day 5753
Name: count, dtype: int64
ChestScan
0 136176
1 101454
Name: count, dtype: int64
RaceEthnicityCategory
White only, Non-Hispanic 179369
Hispanic 22023
Black only, Non-Hispanic 19053
Other race only, Non-Hispanic 11802
Multiracial, Non-Hispanic 5383
Name: count, dtype: int64
AlcoholDrinkers
1 129576
0 108054
Name: count, dtype: int64
HIVTesting
0 156195
1 81435
Name: count, dtype: int64
FluVaxLast12
1 126397
0 111233
Name: count, dtype: int64
PneumoVaxEver
0 140885
1 96745
Name: count, dtype: int64
TetanusLast10Tdap
No, did not receive any tetanus shot in the past 10 years 79370
Yes, received tetanus shot but not sure what type 71538
Yes, received Tdap 67418
Yes, received tetanus shot, but not Tdap 19304
Name: count, dtype: int64
HighRiskLastYear
0 227454
1 10176
Name: count, dtype: int64
CovidPos
0 167306
1 70324
Name: count, dtype: int64
df.nunique()
PatientID 237630
State 54
Sex 2
GeneralHealth 5
AgeCategory 13
HeightInMeters 101
WeightInKilograms 513
BMI 3503
HadHeartAttack 2
HadAngina 2
HadStroke 2
HadAsthma 2
HadSkinCancer 2
HadCOPD 2
HadDepressiveDisorder 2
HadKidneyDisease 2
HadArthritis 2
HadDiabetes 4
DeafOrHardOfHearing 2
BlindOrVisionDifficulty 2
DifficultyConcentrating 2
DifficultyWalking 2
DifficultyDressingBathing 2
DifficultyErrands 2
SmokerStatus 4
ECigaretteUsage 4
ChestScan 2
RaceEthnicityCategory 5
AlcoholDrinkers 2
HIVTesting 2
FluVaxLast12 2
PneumoVaxEver 2
TetanusLast10Tdap 4
HighRiskLastYear 2
CovidPos 2
dtype: int64
df.drop(['PatientID'],
axis=1,
inplace=True)
df.isna().sum()
State 0
Sex 0
GeneralHealth 0
AgeCategory 0
HeightInMeters 0
WeightInKilograms 0
BMI 0
HadHeartAttack 0
HadAngina 0
HadStroke 0
HadAsthma 0
HadSkinCancer 0
HadCOPD 0
HadDepressiveDisorder 0
HadKidneyDisease 0
HadArthritis 0
HadDiabetes 0
DeafOrHardOfHearing 0
BlindOrVisionDifficulty 0
DifficultyConcentrating 0
DifficultyWalking 0
DifficultyDressingBathing 0
DifficultyErrands 0
SmokerStatus 0
ECigaretteUsage 0
ChestScan 0
RaceEthnicityCategory 0
AlcoholDrinkers 0
HIVTesting 0
FluVaxLast12 0
PneumoVaxEver 0
TetanusLast10Tdap 0
HighRiskLastYear 0
CovidPos 0
dtype: int64
cat_input_vars = ['State',
'Sex',
'GeneralHealth',
'AgeCategory',
'HadDiabetes',
'SmokerStatus',
'ECigaretteUsage',
'RaceEthnicityCategory',
'TetanusLast10Tdap']
cont_input_vars = ['HeightInMeters',
'WeightInKilograms',
'BMI',
'HadAngina',
'HadStroke',
'HadAsthma',
'HadSkinCancer',
'HadCOPD',
'HadDepressiveDisorder',
'HadKidneyDisease',
'HadArthritis',
'DeafOrHardOfHearing',
'BlindOrVisionDifficulty',
'DifficultyConcentrating',
'DifficultyWalking',
'DifficultyDressingBathing',
'DifficultyErrands',
'ChestScan',
'AlcoholDrinkers',
'HIVTesting',
'FluVaxLast12',
'PneumoVaxEver',
'HighRiskLastYear',
'CovidPos']
target= 'HadHeartAttack'
- The dataset reveals expected patterns in both continuous and binary variables. Continuous variables like HeightInMeters and WeightInKilograms follow typical human distributions, with Height being nearly normal and Weight slightly skewed towards higher values; BMI shows a right skew, indicating a prevalence of healthy to overweight individuals with fewer in higher BMI ranges. Most binary health conditions, including HadAngina, HadStroke, and others, are rare, with the majority of individuals unaffected, suggesting a generally healthy population with occasional cases. However, conditions like HadArthritis, AlcoholDrinkers, and ChestScan appear more evenly distributed, reflecting higher prevalence. Similarly, indicators such as FluVaxLast12, PneumoVaxEver, HighRiskLastYear, and CovidPos show that most individuals did not experience these events, though a small proportion did, indicating occasional positive cases within the population. Overall, the data suggests a generally healthy population, with specific health issues and vaccine uptake showing variability.
for cont in cont_input_vars:
sns.displot(data=df,
x=cont,
kind='hist',
bins=20,
kde=True,
aspect=2.0,
height=6)
plt.xticks(rotation=90)
plt.title(f'Marginal Distribution of {cont}')
plt.show()
- The categorical data visualizations reveal various demographic and health-related insights. The “Count of State” chart indicates a wide geographic distribution of respondents, with higher counts in states like California and Texas. Gender distribution between “Male” and “Female” is nearly equal, showing balanced representation. General health ratings lean towards “Very Good” and “Good,” suggesting a largely healthy population, with fewer individuals reporting “Poor” health. Age categories are well-distributed, with higher counts in middle-aged and senior groups. In the “HadDiabetes” category, the majority have not been diagnosed, though there is a notable segment with diabetes. Smoking status shows a high proportion of “Never smoked” individuals, followed by “Former smokers,” while e-cigarette usage remains relatively low. The “RaceEthnicityCategory” chart is predominantly “White only, Non-Hispanic,” with smaller representations of other racial groups. The “TetanusLast10Tdap” data reveals that many respondents either did not receive a recent shot or are unsure about the type received, highlighting potential gaps in preventive care awareness.
for cat in cat_input_vars:
sns.catplot(
data=df,
x=cat,
kind='count',
height=6,
aspect=2.0,
legend=False
)
plt.xticks(rotation=90)
plt.title(f'Count of {cat}')
plt.show()
- The visualizations provide a comprehensive analysis of demographic distributions, health conditions, and lifestyle factors across multiple categories, revealing significant geographic, gender, age, racial, and health-related patterns. State-level data shows variation in respondent counts, with states like Washington and Texas having higher numbers. Gender distribution is balanced, and general health ratings are mostly “Very Good” or “Good,” with fewer respondents reporting “Poor” health. Age distribution skews towards older adults, particularly those aged 60-79, and diabetes prevalence increases with age, especially in older groups. Smoking data indicates a majority of respondents have “Never smoked,” while e-cigarette usage remains low. Racially, “White, Non-Hispanic” is the most represented group, with smaller segments from other racial and ethnic backgrounds. Tetanus vaccination records vary, showing a mix of Tdap recipients, other shot types, or no recent shots, suggesting potential gaps in preventive care awareness. Overall, the data points to a generally healthy population with diverse lifestyle habits and health awareness, alongside observed disparities across demographic segments, particularly in health status, diabetes prevalence, and preventive care measures.
pairwise_comparisons = list(itertools.combinations(cat_input_vars, 2))
for cat1, cat2 in pairwise_comparisons:
sns.catplot(data=df,
x=cat1,
hue=cat2,
kind='count',
aspect=2,
height=6)
plt.title(f'{cat1} vs. {cat2}')
plt.xticks(rotation=90)
plt.show()
pairwise_comparisons = list(itertools.combinations(cat_input_vars, 2))
for cat1, cat2 in pairwise_comparisons:
crosstab = pd.crosstab(df[cat1], df[cat2])
plt.figure(figsize=(12, 10))
sns.heatmap(crosstab,
annot=True,
annot_kws={'size': 10},
fmt='d',
cbar=False)
plt.title(f'{cat1} vs. {cat2}')
plt.xticks(rotation=90)
plt.show()
Categorical-to-Continuous Relationships or Conditional Distributions: Box Plots, Violin Plots and Point Plots
- The visualizations illustrate relationships between health conditions, demographic attributes, and self-reported health across U.S. states. Depressive disorder and COVID-19 positivity rates show consistent prevalence across states, while depressive disorders are notably more frequent among females and younger adults (ages 25-39) than in older age groups. Health conditions such as asthma, COPD, and depression are strongly associated with poor self-rated health, as are functional difficulties (concentration, walking, dressing, and errands). In contrast, lifestyle factors like chest scans and alcohol consumption are evenly distributed across health categories, indicating no strong association with self-rated health. The data also reveal state-level variations in preventive care measures (flu and tetanus vaccinations), with differences across racial and ethnic groups, highlighting disparities in high-risk behaviors and health outcomes. This comprehensive overview emphasizes complex associations between demographic factors, preventive care, and health outcomes, pinpointing gaps and areas for improvement in health awareness and care access across diverse U.S. populations.
for cat_input_var in cat_input_vars:
for cont_input_var in cont_input_vars:
sns.catplot(data=df,
x=cat_input_var,
y=cont_input_var,
kind='box',
aspect=2.5,
height=5)
plt.title(f'{cont_input_var} vs. {cat_input_var}')
plt.xticks(rotation=90)
plt.show()
for cat_input_var in cat_input_vars:
for cont_input_var in cont_input_vars:
sns.catplot(data=df,
x=cat_input_var,
y=cont_input_var,
kind='violin',
aspect=2.5,
height=5)
plt.title(f'{cont_input_var} vs. {cat_input_var}')
plt.xticks(rotation=90)
plt.show()
for cat_input_var in cat_input_vars:
for cont_input_var in cont_input_vars:
sns.catplot(data=df,
x=cat_input_var,
y=cont_input_var,
kind='point',
linestyles='',
aspect=2.5,
height=5)
plt.title(f'{cont_input_var} vs. {cat_input_var}')
plt.xticks(rotation=90)
plt.show()
- The correlation matrix reveals relationships between health conditions, lifestyle factors, and demographic attributes, with a strong positive correlation between height and weight (0.47), both also moderately associated with BMI (weight at 0.40). Depressive disorders show correlations with difficulty concentrating (0.34), difficulty walking (0.24), and difficulty with errands (0.23), highlighting the link between mental and physical limitations. Chronic conditions like arthritis, asthma, and COPD correlate with physical difficulties, such as arthritis with difficulty walking (0.32) and difficulty dressing (0.38), indicating that these conditions impact daily activities. Depressive disorders and kidney disease have a moderate correlation (0.20), suggesting a potential link between mental health and overall health decline. Preventive health behaviors, including HIV testing and flu vaccination, show low correlations with most health conditions, suggesting these are applied broadly across health statuses. Alcohol consumption has minimal correlations with other variables, indicating no significant associations with health conditions or physical metrics. Overall, the matrix underscores significant associations between mental and physical health challenges and daily functional difficulties, particularly with chronic conditions affecting mobility and daily tasks, while preventive measures and lifestyle factors like vaccination and alcohol use show independence from these health conditions.
fig, ax = plt.subplots(figsize=(12, 10))
sns.heatmap(data=df.drop(columns=[target]).\
corr(numeric_only=True),
vmin=-1,
vmax=1,
center=0,
cbar=False,
annot=True,
annot_kws={'size': 7},
fmt='.2f',
ax=ax)
plt.show()
formula_additive = """
HadHeartAttack ~
C(State) +
C(Sex) +
C(GeneralHealth) +
C(AgeCategory) +
C(HadDiabetes) +
C(SmokerStatus) +
C(ECigaretteUsage) +
C(RaceEthnicityCategory) +
C(TetanusLast10Tdap) +
HeightInMeters +
WeightInKilograms +
BMI +
HadAngina +
HadStroke +
HadAsthma +
HadSkinCancer +
HadCOPD +
HadDepressiveDisorder +
HadKidneyDisease +
HadArthritis +
DeafOrHardOfHearing +
BlindOrVisionDifficulty +
DifficultyConcentrating +
DifficultyWalking +
DifficultyDressingBathing +
DifficultyErrands +
ChestScan +
AlcoholDrinkers +
HIVTesting +
FluVaxLast12 +
PneumoVaxEver +
HighRiskLastYear +
CovidPos
"""
formula_additive_and_interactive = """
HadHeartAttack ~
C(State) +
C(Sex) * C(AgeCategory) +
C(GeneralHealth) +
C(HadDiabetes) * C(SmokerStatus) +
C(ECigaretteUsage) +
C(RaceEthnicityCategory) +
C(TetanusLast10Tdap) +
HeightInMeters +
WeightInKilograms +
BMI +
HadAngina +
HadStroke +
HadAsthma +
HadSkinCancer +
HadCOPD +
HadDepressiveDisorder +
HadKidneyDisease +
HadArthritis +
DeafOrHardOfHearing +
BlindOrVisionDifficulty +
DifficultyConcentrating +
DifficultyWalking +
DifficultyDressingBathing +
DifficultyErrands +
ChestScan +
AlcoholDrinkers +
HIVTesting +
FluVaxLast12 +
PneumoVaxEver +
HighRiskLastYear +
CovidPos
"""
formulas = [formula_additive,
formula_additive_and_interactive]
from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(n_splits=5,
shuffle=True,
random_state=101)
input_names = df.drop(columns=[target]).\
copy().\
columns.\
to_list()
output_name = target
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
def my_coefplot(model, figsize_default=(10, 4), figsize_expansion_factor=0.5, max_default_vars=10):
"""
Function that plots a coefficient plot with error bars for a given statistical model
and prints out which variables are statistically significant and whether they are positive or negative.
The graph height dynamically adjusts based on the number of variables.
Params:
model: object
figsize_default: tuple, optional
figsize_expansion_factor: float, optional
max_default_vars: int, optional
"""
# cap the standard errors (bse) to avoid overly large error bars, upper bound set to 2
capped_bse = model.bse.clip(upper=2)
# calculate the minimum and maximum coefficient values adjusted by the standard errors
coef_min = (model.params - 2 * capped_bse).min()
coef_max = (model.params + 2 * capped_bse).max()
# define buffer space for the x-axis limits
buffer_space = 0.5
xlim_min = coef_min - buffer_space
xlim_max = coef_max + buffer_space
# dynamically calculate figure height based on the number of variables
num_vars = len(model.params)
if num_vars > max_default_vars:
height = figsize_default[1] + figsize_expansion_factor * (num_vars - max_default_vars)
else:
height = figsize_default[1]
# create the plot
fig, ax = plt.subplots(figsize=(figsize_default[0], height))
# identify statistically significant and non-significant variables based on p-values
significant_vars = model.pvalues[model.pvalues < 0.05].index
not_significant_vars = model.pvalues[model.pvalues >= 0.05].index
# plot non-significant variables with grey error bars
ax.errorbar(y=not_significant_vars,
x=model.params[not_significant_vars],
xerr=2 * capped_bse[not_significant_vars],
fmt='o',
color='grey',
ecolor='grey',
elinewidth=2,
ms=10,
label='not significant')
# plot significant variables with red error bars
ax.errorbar(y=significant_vars,
x=model.params[significant_vars],
xerr=2 * capped_bse[significant_vars],
fmt='o',
color='red',
ecolor='red',
elinewidth=2,
ms=10,
label='significant (p < 0.05)')
# add a vertical line at 0 to visually separate positive and negative coefficients
ax.axvline(x=0, linestyle='--', linewidth=2.5, color='grey')
# adjust the x-axis limits to add some buffer space on either side
ax.set_xlim(min(-0.5, coef_min - 0.2), max(0.5, coef_max + 0.2))
ax.set_xlabel('coefficient value')
# add legend to distinguish between significant and non-significant variables
ax.legend()
# show the plot
plt.show()
# print the summary of statistically significant variables
print('\n--- statistically significant variables ---')
# check if there are any significant variables, if not, print a message
if significant_vars.empty:
print('No statistically significant variables found.')
else:
# for each significant variable, print its coefficient, standard error, p-value, and direction
for var in significant_vars:
coef_value = model.params[var]
std_err = model.bse[var]
p_val = model.pvalues[var]
direction = 'positive' if coef_value > 0 else 'negative'
print(f'variable: {var}, coefficient: {coef_value:.4f}, std err: {std_err:.4f}, p-value: {p_val:.4f}, direction: {direction}')
def train_and_test_logistic_with_cv(model, formula, df, x_names, y_name, cv, threshold=0.5, use_scaler=True):
"""
Function to train and test a logistic binary classification model with Cross-Validation,
including accuracy and ROC AUC score calculations.
Params:
model: object
formula: str
df: object
x_names: list
y_name: str
cv: object
threshold: float, optional
use_scaler: bool, optional
Returns:
object
"""
# separate the inputs and output
input_df = df.loc[:, x_names].copy()
# initialize the performance metric storage lists
train_res = []
test_res = []
train_auc_scores = []
test_auc_scores = []
# split the data and iterate over the folds
for train_id, test_id in cv.split(input_df.to_numpy(), df[y_name].to_numpy()):
# subset the training and test splits within each fold
train_data = df.iloc[train_id, :].copy()
test_data = df.iloc[test_id, :].copy()
# if the use_scaler flag is set, standardize the numeric features within each fold
if use_scaler:
scaler = StandardScaler()
# identify numeric columns to scale, excluding the target variable
columns_to_scale = train_data.select_dtypes(include=[np.number]).columns.tolist()
columns_to_scale = [col for col in columns_to_scale if col != y_name]
# fit scaler on training data
scaler.fit(train_data[columns_to_scale])
# transform training and test data
train_data[columns_to_scale] = scaler.transform(train_data[columns_to_scale])
test_data[columns_to_scale] = scaler.transform(test_data[columns_to_scale])
# fit the model on the training data within the current fold
a_model = smf.logit(formula=formula, data=train_data).fit()
# predict the training within each fold
train_copy = train_data.copy()
train_copy['pred_probability'] = a_model.predict(train_data)
train_copy['pred_class'] = np.where(train_copy.pred_probability > threshold, 1, 0)
# predict the testing within each fold
test_copy = test_data.copy()
test_copy['pred_probability'] = a_model.predict(test_data)
test_copy['pred_class'] = np.where(test_copy.pred_probability > threshold, 1, 0)
# calculate the performance metric (accuracy) on the training set within the fold
train_res.append(np.mean(train_copy[y_name] == train_copy.pred_class))
# calculate the performance metric (accuracy) on the testing set within the fold
test_res.append(np.mean(test_copy[y_name] == test_copy.pred_class))
# calculate the roc_auc_score for the training set
train_auc_scores.append(roc_auc_score(train_copy[y_name], train_copy['pred_probability']))
# calculate the roc_auc_score for the testing_set
test_auc_scores.append(roc_auc_score(test_copy[y_name], test_copy['pred_probability']))
# book keeping to store the results (accuracy)
train_df = pd.DataFrame({'accuracy': train_res, 'roc_auc': train_auc_scores})
train_df['from_set'] = 'training'
train_df['fold_id'] = train_df.index + 1
test_df = pd.DataFrame({'accuracy': test_res, 'roc_auc': test_auc_scores})
test_df['from_set'] = 'testing'
test_df['fold_id'] = test_df.index + 1
# combine the splits together
res_df = pd.concat([train_df, test_df], ignore_index=True)
# add information about the model
res_df['model'] = model
res_df['formula'] = formula
res_df['num_coefs'] = len(a_model.params)
res_df['threshold'] = threshold
# return the results DataFrame
return res_df
import os
import contextlib
res_list = []
error_log = []
with contextlib.redirect_stdout(open(os.devnull, 'w')), contextlib.redirect_stderr(open(os.devnull, 'w')):
for model in range(len(formulas)):
try:
res_list.append(train_and_test_logistic_with_cv(model,
formula=formulas[model],
df=df,
x_names=input_names,
y_name=output_name,
cv=kf))
except Exception as e:
error_log.append(f'Formula ID {model} failed: {str(e)}')
cv_results = pd.concat(res_list, ignore_index=True)
cv_results
accuracy | roc_auc | from_set | fold_id | model | formula | num_coefs | threshold | |
---|---|---|---|---|---|---|---|---|
0 | 0.947802 | 0.888335 | training | 1 | 0 | \n HadHeartAttack ~ \n C(State) + \n ... | 111 | 0.5 |
1 | 0.947592 | 0.888224 | training | 2 | 0 | \n HadHeartAttack ~ \n C(State) + \n ... | 111 | 0.5 |
2 | 0.947665 | 0.887070 | training | 3 | 0 | \n HadHeartAttack ~ \n C(State) + \n ... | 111 | 0.5 |
3 | 0.947534 | 0.886954 | training | 4 | 0 | \n HadHeartAttack ~ \n C(State) + \n ... | 111 | 0.5 |
4 | 0.947481 | 0.886485 | training | 5 | 0 | \n HadHeartAttack ~ \n C(State) + \n ... | 111 | 0.5 |
5 | 0.947229 | 0.881833 | testing | 1 | 0 | \n HadHeartAttack ~ \n C(State) + \n ... | 111 | 0.5 |
6 | 0.947587 | 0.883013 | testing | 2 | 0 | \n HadHeartAttack ~ \n C(State) + \n ... | 111 | 0.5 |
7 | 0.947292 | 0.886866 | testing | 3 | 0 | \n HadHeartAttack ~ \n C(State) + \n ... | 111 | 0.5 |
8 | 0.946892 | 0.888745 | testing | 4 | 0 | \n HadHeartAttack ~ \n C(State) + \n ... | 111 | 0.5 |
9 | 0.947860 | 0.889945 | testing | 5 | 0 | \n HadHeartAttack ~ \n C(State) + \n ... | 111 | 0.5 |
10 | 0.947823 | 0.888659 | training | 1 | 1 | \n HadHeartAttack ~\n C(State) +\n C(... | 132 | 0.5 |
11 | 0.947597 | 0.888564 | training | 2 | 1 | \n HadHeartAttack ~\n C(State) +\n C(... | 132 | 0.5 |
12 | 0.947729 | 0.887387 | training | 3 | 1 | \n HadHeartAttack ~\n C(State) +\n C(... | 132 | 0.5 |
13 | 0.947502 | 0.887271 | training | 4 | 1 | \n HadHeartAttack ~\n C(State) +\n C(... | 132 | 0.5 |
14 | 0.947597 | 0.886813 | training | 5 | 1 | \n HadHeartAttack ~\n C(State) +\n C(... | 132 | 0.5 |
15 | 0.947187 | 0.881816 | testing | 1 | 1 | \n HadHeartAttack ~\n C(State) +\n C(... | 132 | 0.5 |
16 | 0.947439 | 0.882991 | testing | 2 | 1 | \n HadHeartAttack ~\n C(State) +\n C(... | 132 | 0.5 |
17 | 0.947292 | 0.887122 | testing | 3 | 1 | \n HadHeartAttack ~\n C(State) +\n C(... | 132 | 0.5 |
18 | 0.947208 | 0.888707 | testing | 4 | 1 | \n HadHeartAttack ~\n C(State) +\n C(... | 132 | 0.5 |
19 | 0.948113 | 0.889904 | testing | 5 | 1 | \n HadHeartAttack ~\n C(State) +\n C(... | 132 | 0.5 |
cv_results.loc[(cv_results['from_set'] == 'testing') &
(cv_results['accuracy'] < 1.0) &
(cv_results['roc_auc'] < 1.0)].\
groupby('model').\
aggregate({'accuracy': 'mean',
'roc_auc': 'mean',
'num_coefs': 'first'}).\
reset_index().\
sort_values(by='accuracy', ascending=False)
model | accuracy | roc_auc | num_coefs | |
---|---|---|---|---|
1 | 1 | 0.947448 | 0.886108 | 132 |
0 | 0 | 0.947372 | 0.886080 | 111 |
- The best logistic regression model achieves an accuracy of 0.9474 and an ROC AUC of 0.8861, with 111 coefficients, indicating strong predictive performance and robustness across multiple predictors. The intercept is highly negative (-6.1308, p < 0.0001), suggesting a low baseline probability. Among states, South Dakota (0.5115, p = 0.0001), Maine (0.4423, p = 0.0004), and several others show positive coefficients, reflecting higher probabilities in these locations. Age is a strong positive predictor, with coefficients increasing progressively for older groups, notably ages 80+ (2.4317, p < 0.0001) and 75-79 (2.2763, p < 0.0001), underscoring age as a crucial factor. General health ratings also correlate, with “Poor” health having a substantial positive effect (1.0895, p < 0.0001), followed by “Fair” (0.9304, p < 0.0001). Health conditions like angina (2.4228, p < 0.0001), stroke (0.8598, p < 0.0001), and diabetes (0.3343, p < 0.0001) contribute positively, highlighting their association with the outcome. Difficulty-related variables, such as difficulty walking (0.0897, p = 0.0015) and concentrating (0.0697, p = 0.0346), show moderate positive correlations. Smoking status has a negative impact, with “Former smoker” (-0.2530, p < 0.0001) and “Never smoked” (-0.5112, p < 0.0001) associated with lower probabilities, while e-cigarette usage “Not at all (right now)” (0.0570, p = 0.0484) has a positive influence. Preventive measures show mixed effects, with flu vaccination last year (-0.1420, p < 0.0001) and tetanus shot (Tdap, -0.1020, p = 0.0006) negatively correlated, while pneumococcal vaccine (0.0846, p = 0.0010) is positive. Health behaviors, like chest scans (0.5830, p < 0.0001), indicate an association, while alcohol consumption shows a negative correlation (-0.2104, p < 0.0001). This model suggests that older age, poor health, and specific chronic conditions or disabilities strongly increase the likelihood of the outcome, with some geographic and lifestyle factors contributing as well.
best_model = smf.logit(formula=formulas[0],
data=df).fit()
Optimization terminated successfully.
Current function value: 0.147331
Iterations 10
best_model.params
Intercept -6.130787
C(State)[T.Alaska] 0.265161
C(State)[T.Arizona] 0.310066
C(State)[T.Arkansas] 0.253471
C(State)[T.California] 0.189936
...
HIVTesting 0.065535
FluVaxLast12 -0.142049
PneumoVaxEver 0.084587
HighRiskLastYear 0.100571
CovidPos 0.027686
Length: 111, dtype: float64
best_model.pvalues < 0.05
Intercept True
C(State)[T.Alaska] False
C(State)[T.Arizona] True
C(State)[T.Arkansas] False
C(State)[T.California] False
...
HIVTesting True
FluVaxLast12 True
PneumoVaxEver True
HighRiskLastYear False
CovidPos False
Length: 111, dtype: bool
best_model.params[best_model.pvalues < 0.05].sort_values(ascending=False)
C(AgeCategory)[T.Age 80 or older] 2.431717
HadAngina 2.422814
C(AgeCategory)[T.Age 75 to 79] 2.276268
C(AgeCategory)[T.Age 70 to 74] 2.150908
C(AgeCategory)[T.Age 65 to 69] 2.036877
C(AgeCategory)[T.Age 60 to 64] 1.869937
C(AgeCategory)[T.Age 55 to 59] 1.779459
C(AgeCategory)[T.Age 50 to 54] 1.572669
C(AgeCategory)[T.Age 45 to 49] 1.383461
C(GeneralHealth)[T.Poor] 1.089501
C(GeneralHealth)[T.Fair] 0.930421
C(AgeCategory)[T.Age 40 to 44] 0.906325
HadStroke 0.859761
C(AgeCategory)[T.Age 35 to 39] 0.732504
C(Sex)[T.Male] 0.696715
C(GeneralHealth)[T.Good] 0.680010
ChestScan 0.583000
C(State)[T.South Dakota] 0.511517
C(State)[T.Maine] 0.442309
C(AgeCategory)[T.Age 30 to 34] 0.423445
C(RaceEthnicityCategory)[T.Multiracial, Non-Hispanic] 0.407343
C(State)[T.New Hampshire] 0.341477
C(GeneralHealth)[T.Very good] 0.338416
C(HadDiabetes)[T.Yes] 0.334325
C(HadDiabetes)[T.Yes, but only during pregnancy (female)] 0.333506
C(State)[T.Nebraska] 0.332872
C(State)[T.Idaho] 0.328569
C(State)[T.Colorado] 0.324521
C(State)[T.Vermont] 0.311294
C(State)[T.Massachusetts] 0.310719
C(State)[T.Arizona] 0.310066
C(RaceEthnicityCategory)[T.Other race only, Non-Hispanic] 0.309601
C(State)[T.Montana] 0.292848
C(State)[T.New Mexico] 0.280003
C(State)[T.Ohio] 0.277925
C(State)[T.Florida] 0.265763
C(RaceEthnicityCategory)[T.Hispanic] 0.207655
BlindOrVisionDifficulty 0.132919
C(RaceEthnicityCategory)[T.White only, Non-Hispanic] 0.124063
DifficultyWalking 0.089650
DifficultyErrands 0.089172
PneumoVaxEver 0.084587
HadKidneyDisease 0.069949
DifficultyConcentrating 0.069724
HIVTesting 0.065535
C(ECigaretteUsage)[T.Not at all (right now)] 0.057042
C(TetanusLast10Tdap)[T.Yes, received Tdap] -0.101976
HadSkinCancer -0.124903
FluVaxLast12 -0.142049
AlcoholDrinkers -0.210428
C(SmokerStatus)[T.Former smoker] -0.253021
C(SmokerStatus)[T.Never smoked] -0.511246
Intercept -6.130787
dtype: float64
my_coefplot(best_model)
--- statistically significant variables ---
variable: Intercept, coefficient: -6.1308, std err: 0.7241, p-value: 0.0000, direction: negative
variable: C(State)[T.Arizona], coefficient: 0.3101, std err: 0.1275, p-value: 0.0150, direction: positive
variable: C(State)[T.Colorado], coefficient: 0.3245, std err: 0.1362, p-value: 0.0172, direction: positive
variable: C(State)[T.Florida], coefficient: 0.2658, std err: 0.1220, p-value: 0.0294, direction: positive
variable: C(State)[T.Idaho], coefficient: 0.3286, std err: 0.1433, p-value: 0.0219, direction: positive
variable: C(State)[T.Maine], coefficient: 0.4423, std err: 0.1259, p-value: 0.0004, direction: positive
variable: C(State)[T.Massachusetts], coefficient: 0.3107, std err: 0.1338, p-value: 0.0202, direction: positive
variable: C(State)[T.Montana], coefficient: 0.2928, std err: 0.1353, p-value: 0.0304, direction: positive
variable: C(State)[T.Nebraska], coefficient: 0.3329, std err: 0.1288, p-value: 0.0098, direction: positive
variable: C(State)[T.New Hampshire], coefficient: 0.3415, std err: 0.1354, p-value: 0.0117, direction: positive
variable: C(State)[T.New Mexico], coefficient: 0.2800, std err: 0.1416, p-value: 0.0480, direction: positive
variable: C(State)[T.Ohio], coefficient: 0.2779, std err: 0.1211, p-value: 0.0217, direction: positive
variable: C(State)[T.South Dakota], coefficient: 0.5115, std err: 0.1314, p-value: 0.0001, direction: positive
variable: C(State)[T.Vermont], coefficient: 0.3113, std err: 0.1341, p-value: 0.0202, direction: positive
variable: C(Sex)[T.Male], coefficient: 0.6967, std err: 0.0309, p-value: 0.0000, direction: positive
variable: C(GeneralHealth)[T.Fair], coefficient: 0.9304, std err: 0.0513, p-value: 0.0000, direction: positive
variable: C(GeneralHealth)[T.Good], coefficient: 0.6800, std err: 0.0478, p-value: 0.0000, direction: positive
variable: C(GeneralHealth)[T.Poor], coefficient: 1.0895, std err: 0.0595, p-value: 0.0000, direction: positive
variable: C(GeneralHealth)[T.Very good], coefficient: 0.3384, std err: 0.0490, p-value: 0.0000, direction: positive
variable: C(AgeCategory)[T.Age 30 to 34], coefficient: 0.4234, std err: 0.1789, p-value: 0.0180, direction: positive
variable: C(AgeCategory)[T.Age 35 to 39], coefficient: 0.7325, std err: 0.1658, p-value: 0.0000, direction: positive
variable: C(AgeCategory)[T.Age 40 to 44], coefficient: 0.9063, std err: 0.1600, p-value: 0.0000, direction: positive
variable: C(AgeCategory)[T.Age 45 to 49], coefficient: 1.3835, std err: 0.1538, p-value: 0.0000, direction: positive
variable: C(AgeCategory)[T.Age 50 to 54], coefficient: 1.5727, std err: 0.1507, p-value: 0.0000, direction: positive
variable: C(AgeCategory)[T.Age 55 to 59], coefficient: 1.7795, std err: 0.1491, p-value: 0.0000, direction: positive
variable: C(AgeCategory)[T.Age 60 to 64], coefficient: 1.8699, std err: 0.1484, p-value: 0.0000, direction: positive
variable: C(AgeCategory)[T.Age 65 to 69], coefficient: 2.0369, std err: 0.1482, p-value: 0.0000, direction: positive
variable: C(AgeCategory)[T.Age 70 to 74], coefficient: 2.1509, std err: 0.1486, p-value: 0.0000, direction: positive
variable: C(AgeCategory)[T.Age 75 to 79], coefficient: 2.2763, std err: 0.1496, p-value: 0.0000, direction: positive
variable: C(AgeCategory)[T.Age 80 or older], coefficient: 2.4317, std err: 0.1498, p-value: 0.0000, direction: positive
variable: C(HadDiabetes)[T.Yes], coefficient: 0.3343, std err: 0.0249, p-value: 0.0000, direction: positive
variable: C(HadDiabetes)[T.Yes, but only during pregnancy (female)], coefficient: 0.3335, std err: 0.1493, p-value: 0.0255, direction: positive
variable: C(SmokerStatus)[T.Former smoker], coefficient: -0.2530, std err: 0.0359, p-value: 0.0000, direction: negative
variable: C(SmokerStatus)[T.Never smoked], coefficient: -0.5112, std err: 0.0368, p-value: 0.0000, direction: negative
variable: C(ECigaretteUsage)[T.Not at all (right now)], coefficient: 0.0570, std err: 0.0289, p-value: 0.0484, direction: positive
variable: C(RaceEthnicityCategory)[T.Hispanic], coefficient: 0.2077, std err: 0.0627, p-value: 0.0009, direction: positive
variable: C(RaceEthnicityCategory)[T.Multiracial, Non-Hispanic], coefficient: 0.4073, std err: 0.0811, p-value: 0.0000, direction: positive
variable: C(RaceEthnicityCategory)[T.Other race only, Non-Hispanic], coefficient: 0.3096, std err: 0.0691, p-value: 0.0000, direction: positive
variable: C(RaceEthnicityCategory)[T.White only, Non-Hispanic], coefficient: 0.1241, std err: 0.0440, p-value: 0.0049, direction: positive
variable: C(TetanusLast10Tdap)[T.Yes, received Tdap], coefficient: -0.1020, std err: 0.0298, p-value: 0.0006, direction: negative
variable: HadAngina, coefficient: 2.4228, std err: 0.0232, p-value: 0.0000, direction: positive
variable: HadStroke, coefficient: 0.8598, std err: 0.0314, p-value: 0.0000, direction: positive
variable: HadSkinCancer, coefficient: -0.1249, std err: 0.0323, p-value: 0.0001, direction: negative
variable: HadKidneyDisease, coefficient: 0.0699, std err: 0.0345, p-value: 0.0426, direction: positive
variable: BlindOrVisionDifficulty, coefficient: 0.1329, std err: 0.0371, p-value: 0.0003, direction: positive
variable: DifficultyConcentrating, coefficient: 0.0697, std err: 0.0330, p-value: 0.0346, direction: positive
variable: DifficultyWalking, coefficient: 0.0897, std err: 0.0282, p-value: 0.0015, direction: positive
variable: DifficultyErrands, coefficient: 0.0892, std err: 0.0371, p-value: 0.0163, direction: positive
variable: ChestScan, coefficient: 0.5830, std err: 0.0246, p-value: 0.0000, direction: positive
variable: AlcoholDrinkers, coefficient: -0.2104, std err: 0.0224, p-value: 0.0000, direction: negative
variable: HIVTesting, coefficient: 0.0655, std err: 0.0249, p-value: 0.0084, direction: positive
variable: FluVaxLast12, coefficient: -0.1420, std err: 0.0237, p-value: 0.0000, direction: negative
variable: PneumoVaxEver, coefficient: 0.0846, std err: 0.0258, p-value: 0.0010, direction: positive
import pickle
with open('patients_data_logit_model.pkl', 'wb') as file:
pickle.dump(best_model, file)
with open('patients_data_logit_model.pkl', 'rb') as file:
loaded_model = pickle.load(file)
sample_data = pd.DataFrame({
'State': ['California'],
'Sex': ['Male'],
'GeneralHealth': ['Good'],
'AgeCategory': ['Age 18 to 24'],
'HeightInMeters': [1.75],
'WeightInKilograms': [70.0],
'BMI': [22.86],
'HadHeartAttack': [0],
'HadAngina': [0],
'HadStroke': [0],
'HadAsthma': [1],
'HadSkinCancer': [0],
'HadCOPD': [0],
'HadDepressiveDisorder': [0],
'HadKidneyDisease': [0],
'HadArthritis': [0],
'HadDiabetes': ['No'],
'DeafOrHardOfHearing': [0],
'BlindOrVisionDifficulty': [0],
'DifficultyConcentrating': [0],
'DifficultyWalking': [0],
'DifficultyDressingBathing': [0],
'DifficultyErrands': [0],
'SmokerStatus': ['Never smoked'],
'ECigaretteUsage': ['Not at all (right now)'],
'ChestScan': [0],
'RaceEthnicityCategory': ['White only, Non-Hispanic'],
'AlcoholDrinkers': [1],
'HIVTesting': [0],
'FluVaxLast12': [1],
'PneumoVaxEver': [0],
'TetanusLast10Tdap': ['Yes, received Tdap'],
'HighRiskLastYear': [0],
'CovidPos': [0]
})
loaded_model.predict(sample_data)
0 0.003134
dtype: float64