Skip to content

jinysong/data-engineering-projects

Repository files navigation

banner

Amazon Face Mask Rating Prediction with Neural Network Model

Online retailers saw an unprecedented demand for face masks in the year 2020 with the rise of COVID-19. Analyzing the reviews and ratings of products using sentiment analysis with Natural Language Processing (NTLK) can help businesses understand consumer behavior and lead to better product development.

In this machine learning project, I created a Neural Network (NN) model that will predict the star rating on a scale of 1 to 5, based on a written review from real customers on Amazon.

This project demonstrates a complete machine learning workflow:

  • data scraping using Selenium web automation
  • data cleaning and EDA with python, numpy, and pandas
  • data visualization seaborn and matplotlib
  • text data processing and encoding with NTLK
  • neural network modeling, training, and testing with scikit-learn, keras, and tensorflow
  • model evaluation using various metrics

Summary | Notebooks

banner

Sale Volume Prediction with XGBoost

The goal of this project is to work with time-series data and use XGBoost to forecast sales volume for each store. A unique aspect of the dataset is that that the list of stores and products changes every month and there are new items in the testing dataset that are not present in the training dataset.

Project workflow summary:

  • process outliers
  • impute missing data
  • discover data duplication
  • encode features
  • time-series analysis
  • feature engineering using target lags
  • generate trend features
  • modeling with XGBoost
  • model evaluation

Summary | Notebook

banner house price

House Price Prediction with Stacked Regression

The goal of this notebook is to predict house prices using stacked regression models.

Project workflow summary:

  • process outliers
  • process missing data, impute values from other features
  • perform logarithmic transformation on skewed data
  • shuffle and splitting the data for training, validation, and testing
  • produce base models using lasso regression, elastic net regression, kernel ridge regression, gradient boosting regression, XGBoost, and LightGMB
  • stacking models using the meta-model method where out-of-fold predictions made on the holdout dataset are used as training for a meta-model.
  • using root mean squared log error to evaluate results, which is more robust to outliers compared to traditional RMSE.

Summary | Notebook

cc banner

Credit Card Fraud Detection with CNN

It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase. In this notebook, we will compare the different ways we can handle an imbalanced dataset in machine learning.

Most machine learning algorithms work best when the number of samples in each class is roughly equal and balance. However, with anomaly detection problems, the positive class will always be a small portion of the overall data. For example, in this credit card dataset, only 0.17% of transactions being classified as fraudulent.

The goal of this notebook is to explore uneven data distributions and use a CNN model to detect anomalies.

Project workflow summary:

  • sample from positive class to balance the data
  • split data into train and test sets
  • use StandardScalar() to normalize features
  • create model using convolutional neural network deep learning algorithm
  • evaluate model accuracy

Notebook

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages