This tool is designed to facilitate the analysis and visualization of medical datasets using Python. The project was created and tested on Jupyter Notebook.
- Data Cleaning: Handles missing values and outliers in medical datasets.
- Exploratory Data Analysis (EDA): Generates summary statistics, correlations, and visualizations.
- Custom Visualizations: Create detailed plots such as histograms, heatmaps, and category plot.
- Correlation Analysis: Calculate and visualize correlation matrices to identify relationships between variables.
- Data Validation: Filters out inconsistent data, such as cases where diastolic pressure is higher than systolic.
- Disease Analysis: Differentiates between the presence and absence of cardiovascular disease.
The analysis is based on the following parameters:
id
age
sex
height
weight
ap_hi
(systolic blood pressure)ap_lo
(diastolic blood pressure)cholesterol
gluc
(glucose level)smoke
alco
(alcohol consumption)active
(physical activity)cardio
(cardiovascular disease indicator)
Ensure you have the following installed on your system:
- Python 3.7+
- Jupyter Notebook
- Required Python libraries:
Install these libraries using pip:
pandas numpy matplotlib seaborn scikit-learn
pip install pandas numpy matplotlib seaborn scikit-learn
- Clone the repository:
git clone https://github.com/AyobamiMichael/medicaldata_analysis.git
- Navigate to the project directory:
cd medical_data_analysis
- Open the project in Jupyter Notebook:
jupyter notebook
- Launch the main analysis notebook:
medical_data_visualizer.ipynb
.
- Load Data: Upload your medical dataset in
.csv
format. - Clean Data:
- Filter out patient segments where diastolic pressure (
ap_lo
) is higher than systolic pressure (ap_hi
). - Remove or handle other inconsistent or missing data as needed.
- Filter out patient segments where diastolic pressure (
- Run Analysis:
- Perform exploratory data analysis to examine relationships between parameters.
- Calculate the correlation matrix.
- Generate a heatmap to visualize correlations.
- Differentiate between the presence and absence of cardiovascular disease (
cardio
).
- Save Results: Export cleaned datasets and visualizations.
Here's a snippet of the data cleaning and analysis process:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load the dataset
data = pd.read_csv('medical_examination.csv')
# Filter out invalid data (diastolic > systolic)
data = data[data['ap_lo'] <= data['ap_hi']]
# Calculate the correlation matrix
corr_matrix = data.corr()
# Generate a heatmap
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Correlation Matrix Heatmap')
plt.show()
Contributions are welcome! Follow these steps to contribute:
- Fork the repository.
- Create a new branch:
git checkout -b feature-name
- Commit your changes:
git commit -m "Add new feature"
- Push to the branch:
git push origin feature-name
- Submit a pull request.
This project is licensed under the MIT License.
For any inquiries or feedback, please contact:
- Author: Ayobami Michael Opefeyijimi
- Email: ayobamiwealth@gmail.com
Thank you.