
This report presents a comprehensive analysis of a dataset related to Alzheimer's disease, aiming to investigate the relationship between various characteristics and the diagnosis status (Demented vs. Nondemented). Statistical methods and machine learning algorithms were employed in R to uncover significant patterns and predictors of Alzheimer's disease.
- R: Statistical programming language used for data analysis and visualization.
- Libraries:
- ggplot2: For creating visualizations such as scatter plots and box plots. Link
- dplyr: For data manipulation tasks like filtering and summarizing. Link
- tidyr: For reshaping data frames. Link
- gridExtra: For arranging multiple plots on a single page. Link
- GGally: For creating scatterplot matrices. Link
- corrplot: For visualizing correlation matrices. Link
- factoextra: For visualizing clustering results. Link
- Boruta: For feature selection using the Boruta algorithm. Link
- caret: For machine learning model training and evaluation. Link
- glmnet: For fitting logistic regression models with regularization. Link
- knitr: For generating formatted tables. Link
The dataset comprises demographic information, cognitive assessments, and brain measurements of individuals, along with their diagnosis status (Demented or Nondemented). Before proceeding with the analysis, we standardized the numerical variables and encoded the target variable, assigning a value of 1 for Demented and 0 for Nondemented.
We started the analysis by computing descriptive statistics to summarize the dataset. Visual representations such as boxplots and histograms were created to gain insights into the distribution and potential outliers of each variable.
Two clustering algorithms, namely K-means clustering and hierarchical clustering, were employed to explore the underlying structure of the data. K-means clustering revealed two distinct clusters, while hierarchical clustering provided additional insights into the hierarchical relationships between the data points.

We applied the Boruta feature selection technique to identify the most significant features associated with Alzheimer's disease. This helped us understand which variables play a crucial role in predicting the diagnosis status.

A logistic regression model was developed to predict the diagnosis status of Alzheimer's disease based on the significant predictor variables identified. The model's performance was evaluated using the following metrics:
- Accuracy: The percentage of correctly classified instances out of the total instances. The logistic regression model achieved an accuracy of 92.6%.
- Precision: The proportion of true positive predictions out of all positive predictions made by the model. The precision of the model was 90%.
- Recall: The proportion of true positive predictions out of all actual positive instances in the dataset. The recall rate was 80%.
- F1 Score: The harmonic mean of precision and recall, providing a balance between the two metrics. The F1 score of the model was 88.5%.
These performance metrics indicate that the logistic regression model successfully classified individuals into the Demented or Nondemented categories with reasonable accuracy. The significant predictor variables, including gender, age, MMSE scores, and brain volume, played crucial roles in predicting the diagnosis status of Alzheimer's disease.
Visualisations such as boxplots and histograms provided insights into the distribution and potential outliers of each variable. Clustering analysis revealed distinct groups, with one representing individuals at higher risk of dementia and the other exhibiting better cognitive function. Feature selection highlighted the importance of Clinical Dementia Rating (CDR) and MMSE in predicting Alzheimer's disease.
The analysis provided valuable insights into the dataset related to Alzheimer's disease. Several variables, including age, gender, education, socioeconomic status, cognitive function (MMSE), and brain volume, were found to be associated with the diagnosis of dementia. The logistic regression model demonstrated the significant impact of gender, age, MMSE scores, and brain volume on predicting the diagnosis of Alzheimer's disease. Early detection and intervention based on these predictors could potentially improve patient outcomes and quality of life.
In conclusion, this analysis contributes to our understanding of the potential risk factors and characteristics associated with Alzheimer's disease. By leveraging various statistical techniques and machine learning algorithms, we gained insights into the dataset and identified key predictors of Alzheimer's disease.