Multiple Projects for Data Analysis in various industries(Retail, e-Commerce, Real Estate, Manufacturing, Transportation, etc)
- Understood project manager's requirement for finding important developer/API endpoints
- Cleaned and extracted features of daily_logs from Nov 2017 to Feb 2018, , mapped with application_metadata
- Analyzed and visualized daily_log for trends, correlationships between features
- Exploratory data analysis; Data Mining; Imputation of Missing Data; Feature Selection& Generation;
- Built machine learning Models(Linear Regression, Random Foest, XGBoost) to predict next season Zillow Price; Tuning Parameters and Modifying models.
- R Shiny for visualization of transaction and geometric data(How properties and their price vary from city to city in CA?).
- Built relational-database and ERD to clarify relationships between customers, retailers and products, normalized raw data( MySQL)
- Including EDA, Customer Segmentation(demographic, historical purchase behavior analysis, product-based segments)
- Query in MySQL, visualization in Tableau to provide insights and recommendations to Instacart team
Self-written mixed weak learner of Adaboost with feature selection using Genetic Algorithm on Real-world Binary Classification problems:
- Select 4 base weak learners among 12 learners by grid search, trials and evaluations
- Implemented AdaBoost algorithm with updating weights of both training dataset and learners(respectively) in each iteration
- Applied GA in weak learner combination selection, tuning parameters such as crossover rate, mutation rate, elicit status, etc. to optimize the final model performance.
- Reduced overall model complexity by 75%, without decreasing model preformance while increase model's interpretability
- Conduct parameter tuning and feature engineering, increased 6% of prediction accuracy
- Data cleaning, extraction, EDA of NYC uber rider/driver behavior
- Time series analysis, feature engineering
- Setting 'Churn label' based on different requirements
- Built and modified rider churn prediction models (Logistic Regression, Random Forest) using Sklearn
- Preformed Cost Benefit Analysis of methods in new user acquisition and potential churning user retention
- Conducted data mining, feature extraction on Zillow historical house price estimation and Airbnb short term rent price, based on ad-hoc business target: Find the best investment area for short-term leasing
- Preformed data munging on particular features with multiple units in different datasets, wrote functions to link data together in a scalable way to allow new data append
- Created specific metadata and metrics, such as Cap Rate/Occupancy Rate to refine and better understand business goal
- Successfully target best investment area in NYC based on defined metrics and trend prediction
- Including feature generation/selection, transformation(boxcox, );
- feature selection using various techniques/criteria: C_p, stepwise) in building Linear Regression Model
- Checking assumptions, giving diagnostics using metrics such as studentized residuals, Cook's D, hat matrix diagonals, toleance, VIF, etc.
- Making predictions based on the selected model.