Applying data mining techniques to predict the depression level of Vietnamese students
Report Bug
·
Request Feature
No. | Full Name | Student's ID | Github account | Roles | Contribution | |
---|---|---|---|---|---|---|
1 | Nguyen Hoang Anh Tu | ITDSIU20090 | ITDSIU20090@student.hcmiu.edu.vn | nghganhtu | TEAM LEADER with Model evaluation, Hyper-parameters tuning | 20% |
2 | Nguyen Quang Dieu | ITDSIU20031 | ITDSIU20031@student.hcmiu.edu.vn | itzmealvin | Data preprocessing, Optimize performance | 20% |
3 | Duong Nguyen Gia Khanh | ITDSIU20100 | ITDSIU20100@student.hcmiu.edu.vn | GiaKhanhs | Implement classification/prediction algorithms in Java, Bug fixing | 20% |
4 | Hoang Tuan Kiet | ITDSIU21055 | ITDSIU21055@student.hcmiu.edu.vn | meiskiet | Implement pre-processing code, Workspace set-up | 20% |
5 | Nguyen Hai Ngoc | ITDSIU21057 | ITDSIU21057@student.hcmiu.edu.vn | haingocnguyen | Implement classification/prediction algorithms in Java, Bug fixing | 20% |
The project investigates depression in Vietnamese teenagers aged 15-25 using modern data mining methods, highlighting the need for culturally sensitive early detection strategies to reduce mental health issues.
- Use machine learning models to accurately predict depression levels among Vietnamese students.
- Tests a range of complex machine learning model approaches to identify the most successful ones.
- Increase the endurance and effectiveness of these classification models by including more potent algorithms.
- Apply data mining techniques to real-world datasets to bridge the gap between theoretical research and practical application.
- Empower healthcare practitioners to make more informed clinical decisions and enhance the mental health of Vietnamese students through early detection and intervention.
Depression awareness in Vietnam has increased due to the Internet, but the public's willingness to seek screenings remains low. Self-assessment programs like "15minutes4me" offer hope, but they often rely on predetermined symptoms, overlooking the complex interplay between living environment and stress. This data mining project aims to predict depression levels among Vietnamese students aged 15 to 25 by investigating environmental factors and pressures. Building on prior research, the study uses a classification model and analytic techniques to provide a more nuanced understanding of depression among Vietnamese students.
Using machine learning model built on top of classification algorithms, to improve prediction accuracy in various sectors, particularly in measuring depression levels among Vietnamese students. Advanced algorithms like Support Vector Machine, Naive Bayes, and K-Nearest Neighbors are used to manage large datasets, identify patterns, and make accurate predictions.
- Data collecting
- Data pre-processing
- Build model to classify
- Deploy AI Chatbot based on the result
- More to come...
Please see the open issues for a full list of proposed features ( and known issues).
This study focuses on predicting depression levels among Vietnamese students using data mining techniques. It investigates the accuracy of machine learning algorithms and their practical implications for early detection and intervention, aiming to improve students' well-being and academic performance.
The research used a systematic approach to gather data on stress and pressure among Vietnamese students. A Google Form survey was created based on various studies and conversations with university specialists. The survey identified five main causes of stress: employment, studies, self, family, and love. The Patient Health Questionnaire (PHQ-9) was used to assess depression risk. The survey was distributed to a diverse group of Vietnamese students, ensuring confidentiality and anonymity. The data collected was evaluated for quality and completeness, aiming to improve knowledge and early detection of depression among Vietnamese students.
The pre-processing stage of a depression study involved categorizing responses into five groups based on depression-related factors. Python's'shuffle' function ensured no inherent order or bias. Python was used for data processing due to Weka's limitations and Java's difficulties. Missing data was filled using an imputation method, and string-type responses were converted into numerical form using label encoding. The original dataset had long column names, requiring special encoding for quick access. Questions were encoded with alphabetical letters for clarity. This pre-processing made the dataset suitable for further analysis and machine learning methods.
Machine learning algorithms are crucial for creating classification models, as they identify trends and make data-driven judgments. Techniques like ibk, J48, Logistic Regression, Naive Bayes, OneR, SVM, AdaBoostM1, RandomForest, and ExtraTree are used to classify instances based on their closest neighbors, making them suitable for pattern recognition tasks, binary classification, and large datasets. Ensemble approaches like AdaBoostM1 and RandomForest increase classification performance by integrating multiple classifiers, while ExtraTree increases resilience by creating decision trees with random splits. Weka supports these models and offers full modeling and training capabilities, making the development process easier. Weka functions like buildClassifier are often used throughout the model development process, making the training process quick for many classifiers.
The final step involves evaluating a machine learning model on a test dataset to assess its effectiveness in identifying and forecasting depression levels. This evaluation uses measures like accuracy, precision, recall, and F1 score, ensuring the model's dependability, accuracy, and generalization capabilities. This step is crucial for enabling proactive early depression detection and management, enhancing student well-being. The model's efficacy is assessed using functions like toSummaryString, toMatrixString, and metrics, providing comprehensive results for a comprehensive study.
.idea
folder: provides project-specific parameters and configurations for IntelliJ IDEA.vscode
folder: contains project-specific settings and configuration files for Visual Studio Codedata
folder: to hold the datasets organized by family, love, self, study, and job subfolders. Each contains the full training, testing, and validation dataset in both CSV and ARFF format supported by the Wekademo
folder: it comprises screenshots for the UI and results produced in CSV formatlib
folder: contains external libraries and dependencies needed for the project (weka.jar, extraTree.jar, and JavaFX SDK)out
folder: stores compiled output files, such as class filessrc
folder: the primary source directory for the project’s codemodel
subfolder:based_models
subfolder: provides foundational classifier Java classes for this project. That includes IBk, J48, Logistic Regression, Naïve Bayes, OneR, SVM as default, and J48, Logistics Regression, and SVM as tuning models.ensemble_models
subfolder: contains the ensemble model classes that incorporate many classifiers. That includes AdaBoostM1, ExtraTree, and RandomForest as default and tuning models.- These folders further contain the models and hyperparameter_tuning subfolders for each.
Command.java
: the interface for each classifier model class to implement thevoid exec(DataSource trainSource, DataSource testSource)
function
pre-processing
folder:FS_output
folder: contains output files for JavaFX visualization charthelloFX
folder: contains the source code for a JavaFX applicationDataImporter.java
: class to import data from the specified pathDataProcessor.java
: the class for processing and preparing dataFeatureSelection.java
: class to enable feature selection algorithmsRemoveAttributes.java
: a class that has a function for deleting certain features from the datasetsSplitData.java
: the class that separates datasets into training, testing, and validation sets
Main.java
: the project driver code and contains the Main class.gitignore
: to allow Git VCS to ignore certain files and foldersDM_Project.iml
: to configure the IntelliJ IDEA project fileREADME.md
: a document to outline and explain the project
- Java Development Kit 17++ (i.e. OpenJDK) CLICK TO DOWNLOAD
- Any Java IDE (i.e. JetBrains IntelliJ IDEA) CLICK TO DOWNLOAD
- Clone the repo
git clone https://github.com/GiaKhanhs/DM_Project.git
- Open in a Java IDE (preferably JetBrains IntelliJ IDEA)
- Now click on File > Project Structure... > Library. Click on (+) button on point to the
lib
folder, add the following libraries:
lib/weka.jar
for Weka main librarylib/extraTrees.jar
for ExtraTrees librarylib/javafx-sdk-osx-arm64/lib
for JavaFX on Apple Silicon Macs orlib/javafx-sdk-win64/lib
for JavaFX on Intel/AMD Windows PC. For others platform, please refer to here and downloadJavaFX v22.0.1
for your platform.
-
Open
src/preprocessing/dataImporter.java
and replace the dataset you want to explore in the try block.// macOS/Unix trainSource = new DataSource("data/family/training_data.arff"); testSource = new DataSource("data/family/test_data.arff"); validSource = new DataSource("data/family/validation_data.arff"); // Windows trainSource = new DataSource("data\\family\\training_data.arff"); testSource = new DataSource("data\\family\\test_data.arff"); validSource = new DataSource("data\\family\\validation_data.arff");
-
Open
src/Main.java
and click on RUN button to see the result -
To see the visualization, open either
src/preprocessing/helloFX/PieChartAll.java
orsrc/preprocessing/helloFX/PieChartFactors.java
, then click on Run > Edit Configurations and add the VM like this on macOS/UnixFOR macOS/Unix: --module-path /absolute/path/to/javafx-sdk-22.0.1/lib --add-modules javafx.controls,javafx.fxml FOR Windows: --module-path "\path\to\javafx-sdk-22.0.1\lib" --add-modules javafx.controls,javafx.fxml
then click on Run to see the result
The study analyzed data from 1,500 Vietnamese students, with 739 being female and 761 being male. It revealed that females are more prone to severe depression and have a higher risk of self-harm. Family and work-related stressors were significant contributors to depression among female students.
Physical activity was associated with lower levels of depression, and students whose parents spoiled were more prone to develop depression later in life. Academic stress was a major concern for high school and first-year college students, while graduates and fourth-year college students were more concerned with employment and finances. The LGBT community also experienced significant stress due to familial and personal concerns. Workplace stress was a major problem, particularly among females. Academic stress, school bullying, parental expectations, instructor partiality, and peer pressure were also significant contributors. Romantic relationships were a major stressor for Vietnamese students, contributing to anxiety and emotional illnesses. Family-related stress was a major concern, with high parental expectations, mistaken caring, and financial issues.
The following table displays the optimal model parameters derived from Weka, a hyperparameter tool that uses ten-fold cross-validation to identify optimal parameter combinations for optimal performance on datasets. The table also provides a clear explanation of the model's performance after tweaking, highlighting the importance of thorough tuning.
MODEL | DATASET | PARAMETERS | DESCRIPTIONS |
---|---|---|---|
SVM | family | -C 0.2222222222222222 -C 250007 -E 2.0 -K weka.classifiers.functions.supportVector.PolyKernel -L 0.001 -M -1 -N 0 -P 1.0E-12 -R 1.0E-8 -V -1 -W 1 -calibrator weka.classifiers.functions.Logistic -num-decimal-places 4 |
-C: Complexity parameter (C) of the SVM. -E: Exponent for the polynomial kernel. -K: Kernel function for SVM. -L: Learning rate. -M: Margin parameter. -N: Number of folds for cross-validation. -P: Precision parameter. -R: Regularization parameter. -V: Validation threshold. -W: Weight of the instances. -calibrator: Calibrator function for probability estimation. -num-decimal-places: Number of decimal places in the output. |
SVM | love | -C 0.3333333333333333 -C 250007 -E 2.0 -K weka.classifiers.functions.supportVector.PolyKernel -L 0.001 -M -1 -N 0 -P 1.0E-12 -R 1.0E-8 -V -1 -W 1 -calibrator weka.classifiers.functions.Logistic -num-decimal-places 4 |
Already described |
Random Forest | self | -I 670 -K 0 -M 1.0 -P 100 -S 1 -V 0.001 -num-slots 1 |
-I: Number of iterations (trees) in Random Forest. -K: Number of attributes to randomly investigate in each tree node. -M: Margin parameter. -P: Percentage of data to use for training each tree. -S: Random seed for reproducibility. -V: Variance parameter for Random Forest. -num-slots: Number of execution slots (threads) to use. |
SVM | study | -C 0.4444444444444444 -C 250007 -E 2.0 -K weka.classifiers.functions.supportVector.PolyKernel -L 0.001 -M -1 -N 0 -P 1.0E-12 -R 1.0E-8 -V -1 -W 1 -calibrator weka.classifiers.functions.Logistic -num-decimal-places 4 |
Already described |
Random Forest | work | -I 230 -K 0 -M 1.0 -P 100 -S 1 -V 0.001 -num-slots 1 |
Already described |
The next three table display machine learning models' accuracy on testing datasets before and after hyperparameter tweaking, followed by validation dataset using Weka on ten-fold cross-validation. Results highlight gains in performance and the best-performing model for each dataset, with bold numbers representing best accuracy.
DATASET | IBk | Logistics Regression | NaiveBayes | Random Forest | SVM |
---|---|---|---|---|---|
family | 70.5882 | 70.5882 | 52.9412 | 61.7647 | 82.3529 |
love | 78.5714 | 53.5714 | 64.2857 | 64.2857 | 75.0000 |
self | 76.5625 | 87.5000 | 87.5000 | 79.6875 | 85.9375 |
study | 82.2785 | 94.9367 | 81.0127 | 86.0759 | 87.3418 |
work | 79.1667 | 93.7500 | 83.3333 | 84.3750 | 90.6250 |
DATASET | IBk | Logistics Regression | NaiveBayes | Random Forest | SVM |
---|---|---|---|---|---|
family | N/A | 70.5882 | N/A | 67.6471 | 88.2353 |
love | N/A | 53.5714 | N/A | 64.2857 | 78.5714 |
self | N/A | 89.0625 | N/A | 81.2500 | 84.3750 |
study | N/A | 94.9367 | N/A | 87.3418 | 93.6709 |
work | N/A | 93.7500 | N/A | 85.4167 | 88.5417 |
DATASET | IBk | Logistics Regression | NaiveBayes | Random Forest | SVM |
---|---|---|---|---|---|
family | N/A | 74.0741 | N/A | 81.4815 | 77.7778 |
love | N/A | 77.2727 | N/A | 72.7273 | 81.8182 |
self | N/A | 94.1146 | N/A | 80.3922 | 90.1961 |
study | N/A | 88.8889 | N/A | 82.5397 | 88.8889 |
work | N/A | 94.7368 | N/A | 85.5263 | 96.0526 |
The full version of the results can be found here.
(and more to explore for you to contribute in our project...)
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Nguyen Hoang Anh Tu by Email HERE
Project Link: GitHub HERE
We want to express our sincerest thanks to our lecturer and the people who have helped us to achieve this project's goals:
- Dr. Nguyen Thi Thanh Sang
- MSc. Nguyen Quang Phu
- The README.md template from othneildrew