Skip to content

Projects for Data Analysis(Python: Pandas, Numpy, Sklearn, R: ggplot, Shiny)

Notifications You must be signed in to change notification settings

joyceft/Projects

Repository files navigation

Projects

Multiple Projects for Data Analysis in various industries(Retail, e-Commerce, Real Estate, Manufacturing, Transportation, etc)

Twitch API Analysis(Python):

  • Understood project manager's requirement for finding important developer/API endpoints
  • Cleaned and extracted features of daily_logs from Nov 2017 to Feb 2018, , mapped with application_metadata
  • Analyzed and visualized daily_log for trends, correlationships between features

Zillow House Price Prediction(R/Python):

  • Exploratory data analysis; Data Mining; Imputation of Missing Data; Feature Selection& Generation;
  • Built machine learning Models(Linear Regression, Random Foest, XGBoost) to predict next season Zillow Price; Tuning Parameters and Modifying models.
  • R Shiny for visualization of transaction and geometric data(How properties and their price vary from city to city in CA?).

Instacart Online Grocery Store Customer Reorder Prediction(MySQL, R, Python)

  • Built relational-database and ERD to clarify relationships between customers, retailers and products, normalized raw data( MySQL)
  • Including EDA, Customer Segmentation(demographic, historical purchase behavior analysis, product-based segments)
  • Query in MySQL, visualization in Tableau to provide insights and recommendations to Instacart team

Implementation of Mixed base-learner Adaboost, modified by Genetic Algorithm(R)

Self-written mixed weak learner of Adaboost with feature selection using Genetic Algorithm on Real-world Binary Classification problems:

  • Select 4 base weak learners among 12 learners by grid search, trials and evaluations
  • Implemented AdaBoost algorithm with updating weights of both training dataset and learners(respectively) in each iteration
  • Applied GA in weak learner combination selection, tuning parameters such as crossover rate, mutation rate, elicit status, etc. to optimize the final model performance.
  • Reduced overall model complexity by 75%, without decreasing model preformance while increase model's interpretability
  • Conduct parameter tuning and feature engineering, increased 6% of prediction accuracy

Uber Rider Behavior Analysis(Python)

  • Data cleaning, extraction, EDA of NYC uber rider/driver behavior
  • Time series analysis, feature engineering
  • Setting 'Churn label' based on different requirements
  • Built and modified rider churn prediction models (Logistic Regression, Random Forest) using Sklearn
  • Preformed Cost Benefit Analysis of methods in new user acquisition and potential churning user retention

Real Estate Estimation of best investment area in NYC(Python)

  • Conducted data mining, feature extraction on Zillow historical house price estimation and Airbnb short term rent price, based on ad-hoc business target: Find the best investment area for short-term leasing
  • Preformed data munging on particular features with multiple units in different datasets, wrote functions to link data together in a scalable way to allow new data append
  • Created specific metadata and metrics, such as Cap Rate/Occupancy Rate to refine and better understand business goal
  • Successfully target best investment area in NYC based on defined metrics and trend prediction

Water Usage Capcity Analysis and Prediction(SAS)

  • Including feature generation/selection, transformation(boxcox, );
  • feature selection using various techniques/criteria: C_p, stepwise) in building Linear Regression Model
  • Checking assumptions, giving diagnostics using metrics such as studentized residuals, Cook's D, hat matrix diagonals, toleance, VIF, etc.
  • Making predictions based on the selected model.

About

Projects for Data Analysis(Python: Pandas, Numpy, Sklearn, R: ggplot, Shiny)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published