A flask app for flair Identification for r/india subreddit, which takes a r/india posts' URL and predicts the flair of the post. The web-application is hosted on Heroku at https://redditflairid.herokuapp.com/.
Python packages used
- PRAW
- Scikit-learn
- NLTK
- Numpy
- Pandas
- Flask
The requirements.txt file contains all the dependencies used in the notebook and for developing the flask app.
- model: Contains the trained ML model which makes the prediction.
- notebooks: containes ipynb notebooks of data scrapping, preprocessing, EDA and classification.
- static: Contains the main.css file, used as for frontend.
- support: Contains the scripts for prediction and preprocessing of the text data and config.json.
- templates: Contains HTML files for the web-application.
- app.py: File to run to start the web application.
- requirements.txt: dependendancies.
Edit "config.json" and add in your PRAW credentials
Posts in r/india can be corresponding to multiple topics. Each post is tagged for filtering purposes. These tags are called a flares in the reddit world. r/india has flairs like Politics, AskIndia, Science/Technology etc. The web-application allows the user to enter a r/india URL and displays the predicted flair for the submitted post.
To run on a local server:
- Clone the repository.
git clone https://github.com/gaurav104/Reddit-Flair-Identification-Flask.git
- Create a virtual environment.
python3 -m venv flair_detector
source flair_detector/bin/activate
cd Reddit-Flair-Identification-Flask/
- Install the project dependencies.
pip3 install -r requirements.txt
- To run the server locally, execute the following command.
python3 app.py
In the notebook folder
-
Data Scraping.pynb: Depicts the data scrapping process using Pushshift.
-
Text Preprocessing.ipynb: This notebook describes the data cleaning and the preprocessing, which include steps such as punctuation removal, stopword removal, lemmatization, tokenization, etc.
-
EDA.ipynb: In this notebook, an Exploratory Data Analysis is performed on the cleaned data, we look for average post lengths and number of words present, perform topic modelling using LDA(Latent Dirichlet Allocation) and NMF(Non-negative Matrix Factorization), etc.
-
Classification.ipynb: Performing classification on the pre-processed data, evaluating model's performance, and analysis on the predicted and actual labels.
- Improving the prediction by automatic model parameter update, by training on post from r/india.
- Incorporating DL models, LSTMs/GRUs, Bert.