A collection of Data Science and Data Analysis projects to demonstrate my skill set.
-
Using Python and Pandas:
- Bank Marketing Prediction: Trained several machine learning models to predict if a client will subscribe to a term deposit based on marketing campaigns. Some of the models trained are: logistic regression, K-nearest neighbor, decision tree, random forest (optimized using cross-validation), and neural networks
- Predicting Housing Prices: Built a linear model to predict house sale prices from a dataset with over 500,000 datapoints and 61 variables
Using R:
- Cities: Used a public dataset with agreggated data for each of Brazil's cities such as population, gdp, number of cars, among others, to create a linear model that predicts the population for each city
- Baseball Analysis: Used libraries such as infer, ggplot2, and tidyverse to perform exploratory data analysis and create a linear model that predicts baseball wins using different independent variables
-
Using Python, Pandas, Matplotlib, Seaborn:
- Bike Sharing: Analyzed bike sharing data from Washington D.C. to gain insight about user's behavior
- Text Analysis Using Twitter: Produced tweets' sentiment score using VADER lexicon to predict how positive or negative a tweet is
- Tuscan RFM Marketing Analysis: Took a merchant's dataset and split customers into deciles to identify most profitable customers based on their recency, frequency and monetary values. Calculated gross profit and ROI across all customer segments
Using SQL
- IMDb Analysis: Used SQL to analyze an IMDb dataset. It contains 4 tables with over 121k rows and 23 variables. Leveraged advance SQL commands such as JOIN, WITH, and CASE to get information from multiple tables and answer questions such as "Who are the top 10 most prolific movie actors?" and "How does film length relate to ratings?"
Using R:
- Flights: Took a data set with 113k rows and 19 variables with information about departing flights in the US and analyze the impact that COVID-19 had on flights and some general statistics
- People's Park: Used a data set provided by the Chancellor's Office at University of California, Berkeley with anonymized responses from 1,250 students from a survey designed to get some data on student's perspective on the ongoing controversy of the People's Park project and summarized the survey responses
- Hypothesis Testing Using P-Values: Permuted datasets and the desired statistic to compute p-values and reject or accept null hypothesis for different datasets using the infer library