Last update: 25th July 2017
This is a brief introduction of some of the projects done concerning Machine Learning and Deep Learning models. In most cases data sets are public therefore the source is pointed out in the correspondent README file. Enjoy!
Anonymized credit card transactions labeled as fraudulent or genuine. Anonymization has been achieved performing Principal Component Analysis. 492 frauds out of 284,807 transactions.
- Features: 30
- Observations: 284,807
- Tuples: 8,544,210
Challenges: Imbalanced data | Understanding PCA
Why are our best and most experienced employees leaving prematurely?
- Features: 9
- Observations: 14,999
- Tuples: 134,991
Challenges: Detect outliers
Many real world applications need to know the localization of a user in the world to provide their services. Therefore, automatic user localization has been a hot research topic in the last years. Automatic user localization consists of estimating the position of the user (latitude, longitude and altitude) by using an electronic device, usually a mobile phone. Outdoor localization problem can be solved very accurately thanks to the inclusion of GPS sensors into the mobile devices. However, indoor localization is still an open problem mainly due to the loss of GPS signal in indoor environments. Although, there are some indoor positioning technologies and methodologies, this database is focused on WLAN fingerprint-based ones (also know as WiFi Fingerprinting).
- Features: 529
- Observations: 19,937
- Tuples: 10,546,673
Challenges: Reduce data set for downloading time computation | Indoor localization
It is your job to predict if a passenger survived the sinking of the Titanic or not.
- Features: 11
- Observations: 891
- Tuples: 10,692
Challenges: Missing values Treatment | Working with text
The main purpose is to analyze the Basket composition from purchase tickets to study how consumers buy products together. This analysis might the foundation base for a cluster customer analysis or a product system recommendation.
- Type: list of lists
- Observations: 9835 ticket lists
Challenges: Association Analysis per se
The intention here is just Explore the Dataset. Put in action some Data visualization libraries and tools.
- Features: 9
- Observations: 205,580
- Tuples: 1,850,220
Challenges: Work with different visualization tools | work with Python
For this project I will be exploring a public available data from LendingClub.com. Lending Club is a peer to peer lending platform connecting people who need money (the borrower) with people who have the money (investors). As an investor I would want to invest in people who sowed a profile of having a high probability of paying me back. I am going to create a model that will help me to predict this.
I am going to use data from 2007 to 2010 and be trying to classify and predict whether or not the borrower paid back their loan in full. Webpage repository with this data set is here
- Features: 18
- Observations: 9,578
- Tuples: 172,404
Challenges: ML with Python
This data set is a compendium from different sources, of SMS classified as Spam /Ham. We will need to build a model that easily can detect when a SMS is relevant or not. Similarly to what, nowadays, spam filters do, NLP tools and techniques will help to do it.
- Observations: 5,574
- One label + one string as feature
Challenges: Natural Language Processing with Python
This project has been done in the context of Udacity's Deep Learning Nano degree. It is my first Neural Network and for that the challenges were multiple. This data set consists on information about a business bike rental. I need to build a NN to predict daily bike rental ridership.
The dataset is from UCI repository and can be downloaded here
- Features: 17
- Observations: 17,380
- Tuples: 295,460
Challenges: My first Neural Network, understanding the concepts: back propagation, forward pass, gradient descent and their programming the math without using any deep learning package.
This data is the result of a Wavelet transformation on pictures of banknotes. The class to be predicted is whether the bank note has been forget or, on the contrary, it is authentic.
- Features: 3
- Observations: 1,372
- Tuples: 4,116
Challenges: use of Tensor Flow
In this project, I'll classify images from the CIFAR-10 dataset. The dataset consists of airplanes, dogs, cats, and other objects.
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
Challenges: preprocess the images,build a convolutional, max pooling, dropout, and fully connected layers. Cloudcomputing using Floydhub.
In this project, I'll generate my own Simpsons TV script using RNNs. I'll be using part of the Simpsons dataset of scripts from 27 seasons. The Neural Network I'll build will generate a new TV script for a scene at Moe's Tavern.
Full dataset can be found on Kaggle's database here
Challenges: preprocess text (tokenization, embedding), build a recurrent Neural network, work with LSTM and Word2Vec arquitectures.
In this project, I am going to take a peek into the realm of neural network machine translation. I'll be training a sequence to sequence model on a dataset of English and French sentences that can translate new sentences from English to French.
Challenges: build a sequence to sequence architecture
In this project, I am going to use Generative Adversarial Networks to generate new images of faces. The input will be a bunch of celebrities images that my generator will try to imitate so a new face is created and seen by the discriminator as "Real".
Challenges: GAN