Team project on the subject “Architecture and Technologies of Big Data Systems” at the National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”.
The goal of the project is to build a prediction model that will determine whether a person uses drugs (such as cocaine, crack, or marijuana)
The dataset used for training was the National Survey of Drug Use and Health (2015-2019)
LGBMRegressor
: Uses gradient boosting based on LightGBMXGBRegressor
: Uses gradient boosting based on XGBoostRidge Regressor
: Linear regression with regularization using RidgeGradient Boosting Regressor
: Uses gradient boosting to improve model accuracyRandom Forest Regressor
: Uses an ensemble of decision trees for predictionStackingCVRegressor
: A model that combines the predictions of several underlying models using cross-validation
data/
: The folder with the preprocessed datasetsdoc/
: Іnformation about the dataset is taken from SAMHSA sitedemo/
: Images with the results of model testingmodels/
: Saved models and pipelinenotebooks/
: A Jupyter notebook for data visualization, models training, and analysis of resultsrequirements.txt
: List of required Python packages for installation
Model | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
LightGBM | 0.912226 | 0.992937 | 0.830966 | 0.904760 |
XGBoost | 0.909194 | 0.999785 | 0.819188 | 0.900521 |
Ridge Regression | 0.909039 | 0.999677 | 0.818968 | 0.900344 |
Gradient Boosting | 0.912580 | 0.992683 | 0.831892 | 0.905203 |
Random Forest | 0.908441 | 1.000000 | 0.817512 | 0.899595 |
Stacking | 0.913221 | 0.989709 | 0.835730 | 0.906225 |
git clone https://github.com/TokenRR/Bigdata_university_course.git
cd Bigdata_university_course
pip install -r requirements.txt
You can use the notebooks from the notebooks/
folder to research and analyze the results.
If you would like to contribute to this project, please create a pull request or open a new issue