This repository contains a Machine Learning algorithm designed to identify patient profiles that attend Portuguese Emergency Departments (EDs), particularly those classified as non-urgent. The goal is to improve healthcare planning by supporting Primary Health Care (PHC) and ED resource management.
Student: Mafalda Moreira
Supervisor: Francisco Couto
Co-supervisor: Patrícia Moura Rosa
This project is part of a Master’s thesis focused on improving the coordination between PHC and EDs using data-driven methods. It includes:
- Data preprocessing and filtering
- Clustering analysis with K-Means
- Evaluation using Silhouette, Davies-Bouldin index, and Calinski-Harabasz index scores
- PCA for cluster visualization
clustering_ED_patients.py: Main script for data loading, clustering, visualization, and exporting.fact_table.csv: contains information regarding each ED episodedim_table.csv: contains information regarding healthcare activity recorded across primary care settings, including aggregated clinical indicators that support statistical analysis of service utilization patterns
Note: These files are not included in the repository due to data confidentiality. You must prepare your own CSV files with appropriate structure and variable names as described in the script comments.
- python: 3.13.3
- pandas: 2.2.3
- scikit-learn: 1.6.1
- matplotlib: 3.10.1
- seaborn: 0.13.2
Due to the sensitive nature of the healthcare data used in this study, the dataset cannot be made publicly available. The use of this information is strictly confined to statistical purposes within the scope of public health research, monitoring, and strategic planning, in full compliance with the General Data Protection Regulation (GDPR) and all other applicable legal and ethical standards. Technical identifiers were omitted to safeguard confidentiality and prevent reidentification.