- Introduction
- How to Use the App
- Tools Used
- Directory Structure
- Steps To Run Red-Flair
- Dependicies
- Project Approach
- Analysis of Data
- Sources
Red-Flair is a flair detector for posts belonging to r/india subreddit on Reddit. Its web app is deployed using Heroku which can be accessed here. It scrapes post's using the URL and then uses an SVC model to predict the flair of that post.
-
Open the app at Red-Flair.
-
Open the reddit post belongin to r/india subreddit. Copy the full URL of the post.
-
This will redirect you to Flairs Detected Page. Here it will display your submitted URL on the top, and the flair that has been predected. By clicking on the link which is highlighted in blue, it would re-direct to the URL which was submitted earlier. You might need to re-fresh the pages one or twice to load the content from Heroku Servers.
- Python3
- Sci-kit Learn
- NLTK
- FLASK
- Flask-WTF
- D3.JS
- Xgboost
- Praw
- Jupyter
- Mongodb Compass
- HTML/CSS
- Pandas
- Numpy
- Heroku
RedFlair ├── data │ ├── csv_files │ └── mongodb ├── images ├── others ├── static └── templates
- / : It contains all the files and directories.
- app.py : The Main Python file for Flask Based Web App.
- forms.py : A Python file for taking input from the user using Flask-WTF library.
- model : A Pickled-version of ML model which is used to make predictions.
- model.py : A Python file for predicting Flair.
- nltk.txt : A txt file for Heroku for downloading "stopwords","wordnet","punkt" which are used for data cleaning.
- Procfile : A file which contains the command which are required to run the flask based app on Heroku.
- requirements.txt : A txt file for Heroku to satisfy the requirements to run the web app.
- tfidf : A Pickled-version of TF-IDF VECTORIZER for converting words into vectors.
- Data : It contains all the data used in the project.
- csv_files : It contains the data in .csv files.
- mongodb : It contains the data in .json and .bson (mongodb instance) form.
- images : It contains all the images used to create this README.md
- others : It contains the .ipynb files which are used for scraping data, cleaning data, combining data, training various models and analysing data.
- static : It contains the .css files for the web-app.
- templates : It contains the .html files for the web-app.
- Clone The Repository.
git clone https://github.com/parasmehan123/RedFlair.git
- change directory to RedFlair.
cd RedFlair
- Follow this tutorial to get client id,etc. of Reddit.
- Open model.py and enter the information like client id,etc. in get_flair().
- Run the app.py file.
python3 app.py
- Clone The Repository.
git clone https://github.com/parasmehan123/RedFlair.git
- change directory to RedFlair.
cd RedFlair
- Copy and Paste the .ipynb files from ./others/ to ./data/csv_files
cp ./others/<file>.ipynb ./data/csv_files
- Open Jupyter
jupyter notebook
- Go the the .ipynb file in /data/csv_files and select Cell > "Run All".
- Copy paste the "model" and "tfidf" files back to /RedFlair for using it on the web app.
- Flask==1.0.2
- Flask-WTF==0.14.2
- jupyter==1.0.0
- nltk==3.4
- numpy==1.16.2
- pandas==0.24.2
- praw==6.3.1
- prawcore==1.0.1
- scikit-image==0.14.2
- scikit-learn==0.20.3
- WTForms==2.2.1
- xgboost==0.90
For data acquisition I've used 2 Methods.
Using Python Praw library, I was able to scrape around 1800 posts for eleven flairs, using a simple API calls. The code for data scraping is saved in data_scraping.ipynb and the data is in data.csv.
I've found a database containing Reddit posts on google big-query. Using the below SQL query I've been able to scrape data of 16000 posts belonging to 8 flair categories from January, February, and March 2019. I searched way back till 2017, but couldn't find posts belonging to Political, AMA, Finance/Business flair types, in the database.
SELECT *
FROM [fh-bigquery:reddit_posts.2019_01],[fh-bigquery:reddit_posts.2019_02],[fh-bigquery:reddit_posts.2019_03]
WHERE subreddit == "india" && link_flair_text == "flair"
LIMIT 20000
I've followed the following steps to ceate a model to predict flairs.
Since the data I've got from Google Big-Query didn't have the posts of few flairs types, hence I've combined this data(8 flair types) with the data (remaining three flairs) obtained by using praw library. Then I've divided data of each flair into testing and training with the division being 20:80. As the data from Big-Query was massive (6000 posts in Non-Political type) in certain flairs category hence I've put the condition of a maximum of 500 flairs in training data and 125 in training data. The code of data-combining can be found here and the training data, testing data which is used to build the model can be found here and here.
- I've followed the following steps to clean the text of the posts.
- Convert words to Lowercase.
- Tokenize the word using word_tokenize.
- Removed the stop-words.
- Removed the punctuation.
- Lematization using WordNetLemmatizer.
- I have used the same procedure to clean the post of new test cases on the web-app.
- I've used Tf-Idf Vectorizer from Sklearn.
- I've also exported the vectorizer to be used in the web app.
- I've trained models using different classfiers and features.
- SVC
- DecisionTreeClassifier
- LogisticRegression
- RandomForestClassifier
- XGBClassifier
- Features : Title
Classfier No. | Accuracy on 5 Fold Cross-Validation on Training Set | Accuracy on Testing Set |
---|---|---|
1 | 0.8863863863863863 | 0.47695390781563124 |
2 | 0.9629629629629629 | 0.3937875751503006 |
3 | 0.7114614614614615 | 0.46893787575150303 |
4 | 0.9629629629629629 | 0.46392785571142287 |
5 | 0.6566566566566566 | 0.4248496993987976 |
- Features : URL
Classfier No. | Accuracy on 5 Fold Cross-Validation on Training Set | Accuracy on Testing Set |
---|---|---|
1 | 0.4469469469469469 | 0.3897795591182365 |
2 | 0.4727227227227227 | 0.3657314629258517 |
3 | 0.43343343343343343 | 0.3817635270541082 |
4 | 0.4727227227227227 | 0.3787575150300601 |
5 | 0.4291791791791792 | 0.36472945891783565 |
- Features : Body
Classfier No. | Accuracy on 5 Fold Cross-Validation on Training Set | Accuracy on Testing Set |
---|---|---|
1 | 0.3228228228228228 | 0.23346693386773548 |
2 | 0.3425925925925926 | 0.2154308617234469 |
3 | 0.2702702702702703 | 0.23647294589178355 |
4 | 0.3425925925925926 | 0.23647294589178355 |
5 | 0.33208208208208206 | 0.251503006012024 |
- Features : Title+Body
Classfiers No. | Accuracy on 5 Fold Cross-Validation on Training Set | Accuracy on Testing Set |
---|---|---|
1 | 0.918918918918919 | 0.503006012024048 |
2 | 0.974974974974975 | 0.3807615230460922 |
3 | 0.7457457457457457 | 0.4939879759519038 |
4 | 0.974974974974975 | 0.46893787575150303 |
5 | 0.705955955955956 | 0.47194388777555113 |
- Features : Body+URL
Classfier No. | Accuracy on 5 Fold Cross-Validation on Training Set | Accuracy on Testing Set |
---|---|---|
1 | 0.5690690690690691 | 0.4198396793587174 |
2 | 0.6146146146146146 | 0.37374749498997994 |
3 | 0.49424424424424424 | 0.4248496993987976 |
4 | 0.6146146146146146 | 0.40480961923847697 |
5 | 0.5402902902902903 | 0.41783567134268534 |
- Features : Title+URL
Classfier No. | Accuracy on 5 Fold Cross-Validation on Training Set | Accuracy on Testing Set |
---|---|---|
1 | 0.8893893893893894 | 0.5200400801603207 |
2 | 0.977977977977978 | 0.4308617234468938 |
3 | 0.7424924924924925 | 0.5350701402805611 |
4 | 0.977977977977978 | 0.5260521042084169 |
5 | 0.6896896896896897 | 0.5190380761523046 |
- Features : Title+Body+URL
Classfier No. | Accuracy on 5 Fold Cross-Validation on Training Set | Accuracy on Testing Set |
---|---|---|
1 | 0.9184184184184184 | 0.5531062124248497 |
2 | 0.9854854854854855 | 0.42685370741482964 |
3 | 0.7482482482482482 | 0.5480961923847696 |
4 | 0.9854854854854855 | 0.5320641282565131 |
5 | 0.6984484484484484 | 0.5370741482965932 |
- As I've got the best accuracy by using Title+Body+URL as a feature and using SVC as the classfier hence I'l be using SVC model and Title+Body+URL as features to detect flairs on the web app.
- The code for data cleaning, model training and the results can be found here.
- The web-app is made using flask-library.
- It uses Flask-WTF forms to input the URL from user.
- It passes the URL to model.py
- model.py uses the praw library to scrape the post from reddit and uses the cleaning, vectorizer and model from training processes to predict the flair of posts.
- Templates of the web app can be found here.
- Table of number of posts in each flair category in the data that is analysed.
Flair | No.of Posts |
---|---|
Political | 246 |
Non-Political | 6026 |
[R]eddiquette | 1176 |
AskIndia | 5399 |
Science/Technology | 1169 |
Policy/Economy | 1205 |
Finance/Business | 171 |
Sports | 430 |
Food | 305 |
Photography | 507 |
AMA | 210 |
Flair Wise Analysis Link
-
Multi-Bar Chart for analysing average/median, scores/comments of posts belonging to different flairs.
-
Bar charts for analysing hour of the day at which post of different flairs are posted.