Repository for storing code for my MS in Data Science course CS675 Introduction To Data Science at Pace University.
Course description: This course introduces the students to Machine Learning and Deep Learning Technologies, Data Analytics at scale, and Data-driven Science systems in order to extract insights data from in various forms. These scientific processes will include various phases and techniques such as Data Preparation, Model Building, and Prediction, Clustering, Association, Regression (Linear and Logistic), Classification, Decision Trees, Textual Data Analysis and Data Presentation. The basic concepts will be covered with examples which can be tried on R or Python by using RStudio and/or Jupyter Notebooks (aka IPython Notebooks). These miniaturized examples of real-world problems are designed in such way that the student will gain a clear understanding and get firm foundation of the methods covered in the course. In addition, the course gives an introduction to R Statistical Language, Apache (Databricks) Spark, and Anaconda Analytics platforms.
In this project, the task was to perform an EDA (Exploratory Data Analysis) on a dataset of customer churn in the telecommunications industry. I inspected the raw dataset, cleaned it, and examined each of the variables and their relationships to each other, in order to predict the variables that affect churning of customers (churn is when customers leave the company).
- Class presentation of this project: https://www.youtube.com/watch?v=0U4XsjbPn8U
This project was a continuation of Project #1. In this project, I performed various stages of machine learning analysis on the same dataset and used four models to generate predictions - Naive Bayes, Logistic Regression, Random Forest, and XGBoost. I performed SMOTE analysis and hyperparameter tuning for these models, and analyzed the best model for predicting churn.
- Class presentation of this project: https://www.youtube.com/watch?v=M1PMJYq2hhI
- Full code walkthrough of this project: https://www.youtube.com/watch?v=VVIC3dSqqk8&t=25s
In this project, the task was to perform predictive time-series forecasting on a dataset of New York City's electric consumption for its 5 boroughs. I extracted the data from the City of New York website and used the FB Prophet package for Python to perform predictive time-series forecasting on this dataset.
- Class presentation of this project: https://youtu.be/58s0qYSVGaQ