This repo illustrates the how to build a machine learning classifier to predit the flairs of the post of r/india
Go to r/india and open a post
Copy its url and paste it into the app
Live web app is here: Website
The following installation has been tested on MacOSX 10.13.6 and Ubuntu 16.04.
This project requires Python 3 and the following Python libraries installed(plus a few other s depending on task):
- Clone the repo
git clone https://github.com/gauravchopracg/Reddit-Flair-Detection.git
cd Reddit-Flair-Detection/
- Install Dependencies
pip install -r requirements.txt
In this part, I have collected two dataset:
- 1 year dataset: from 1st January 2019 to 1st January 2020 with features title, flair and body on post using Pushshift's API
- Balanced dataset: 100 post from 9 flairs using praw module.
Two dataset were collected to test different machine learning algorithms and deep learning models one subset and other yearly data, later they were used as train and test set
For detailed notes please look at here
In this part, we have try to understand the data, build intuition about the data and find insights in the data. It consist of:
- Univariate Analysis
- Bivariate Analysis
- Feature Engineering
This part includes :
- Data Preprocessing
- Hyperparamter Optimization
- Choosing a Validation Strategy
- Trying Both machine learning and deep learning framework
Machine Learning Algorithm | Train Accuracy | Validation Accuracy | Test Accuracy |
---|---|---|---|
Logistic Regression (Title only) | 0.615 | 0.623 | 0.402 |
Logistic Regression (Title only + Preprocessing) | 0.546 | 0.493 | 0.621 |
BERT (Title + Body + Preprocessing) | 0.671 | 0.546 | 0.651 |
Web application has been developed with Python and Flask framework. The project has been developed using the tutorial Flask Mega-Tutorial for Python 3.6
To run the app in you computer:
- Clone the repo
$ git clone https://github.com/gauravchopracg/Reddit-Flair-Detection.git
$ cd Reddit-Flair-Detection/Web Application
- Install Dependencies
$ pip install -r requirements.txt
- Import the package
$ export FLASK_APP=rfd.py
If you are using Microsoft Windows, use set instead of export in the command above
- Run
$ flask run
* Serving Flask app "rfd"
* Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
The web application is deployed to Heroku cloud platform. A developer API using flask has been implemented, which returns a JSON containing a python dictionary in which key is URL of post and values are predicted flair.
Can be accessed by querying POST request:
import requests
files = {'upload_file': open('test.txt','rb')}
r = requests.post("http://rdflair.herokuapp.com/automated_testing", files=files)