SC4021 Information Retrieval-Group 24

Group 24 Members

Name	Matriculation Number
Ong Zhi Ying, Adrian	U2121883A
Takesawa Saori	U2023120E
Cheong Yong Wen	U2021159L
Kwok Zong Heng	U2021027E
Mandfred Leow Hong Jie	U2122023G
Mao Yiyun	U2022609J

Project Overview

By 2030, Singapore aims to have a significant portion of its vehicle population comprised of electric vehicles (EVs) as part of its commitment to combat climate change. Given the incentives to switch to EVs, members of the public will soon need to decide on the brand and model of EV to purchase. To assist with this decision-making process, this project aims to design and develop an information retrieval system that can search and display public user comments related to EV brands and models from various social platforms. Additionally, the system will derive deeper insights using Natural Language Processing techniques such as sentiment analysis, subjectivity, and sarcasm classification.

Technical Overview

The project is divided into 4 main components:

Web crawling
Data Indexing (Backend)
Frontend UI
Classification

Pre-requisites to run the code (Exact versions are not required but recommended)

Python 3.8.5
Curl 8.4.0
Apache Solr 9.5.0 (Place it under this repo's folder)
Java 1.8.0_401

Instructions to run the code

Firstly, change directory into the base directory using cd SC4021-Project and install all required libraries using pip install -r combined_requirements.txt
Make sure you have the correct venu or conda environment activated

Web crawling

Run reddit-data-extraction.ipynb -> This notebook contains step by step codes for extracting/crawling data from Reddit using predefined subreddits
After crawling the data, run data-processing-for-solr.ipynb -> Executes basic data pre-processing before ingesting data into Solr

Structure of crawled data

{subreddit_name}-posts.csv -> Contains the top 100 posts from the subreddit
{subreddit_name}-comments.csv -> Contains all the comments associated with the top 100 posts of the subreddit
all-post.csv -> Contains the combined posts from all subreddits
all-comments.csv -> Contains the combined comments from all subreddits
cleaned_combined_data.csv -> Contains both the posts and comments (Normalized) from all subreddits after cleaning

Data Indexing (Backend)

Make sure environment variable $JAVA_HOME is set to the correct Java JDK
Make sure environment variable $PATH is set to the correct Apache Solr directory
Open up CMD and run solr start
You can navigate to localhost:8983/solr to access GUI for Apache Solr, however we will be using Curl to communicate with Solr
Navigate to the jupyter notebook add_solr_schema.ipynb and run the cells to index the data into Apache Solr

Frontend UI

Navigate to the frontend directory by running cd search_engine
Run streamlit run app.py to start the streamlit app

Classification

Different classification innovations are implemented in various notebooks. These can be found under classification_final\models. The notebooks are as follows:

Polarity_and_subjectivity_Detection.ipynb -> This notebook contains the code for detecting the polarity and subjectivity of the comments
inter_annotation_agreement.ipynb -> This notebook contains the code for calculating the inter-annotator agreement
Roberta_mnli_classification_majorityvoting.ipynb -> This notebook contains the code for evaluating dataset selection with 2 roberta models and experimenting voting ensemble
Classification_Bert.ipynb-> Uses BERT pretrain model to predict sentimental analysis on comments

Innovation

Different Innovations are implemented with various notebooks.These can be found under Innovation\models. The notebooks are as follows

Roberta_mnli_classification_majorityvoting.ipynb -> This notebook contains the code for evaluating dataset selection with 2 roberta models and experimenting voting ensemble.
sarcasm_detection -> This notebook contains the code for evaluation for text that are sarcastic
innovation_bert_and_stack_ensemble-> This notebook utilizes the annotated data and splitting into train/test dataset with ratio 75/25. It also contains fine tuning with BERT and comparing the results to a stack ensemble with BERT, RandomForest and LogisticRegression.

Labelled data

popular_comment_Bolt_YWAnnotate.csv -> Contains the labelled data for the Bolt EV labelled by 1 annotator
popular_comment_Bolt_zh_annotate.csv -> Contains the labelled data for the Bolt EV labelled by 1 annotator
popular_comment_Bolt_annotate_Merged.csv -> Contains the labelled data for the Bolt EV labelled by 2 annotators

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
Classification_Final		Classification_Final
Innovation		Innovation
Sarcasm_Detection		Sarcasm_Detection
classification		classification
data		data
search_engine		search_engine
.gitignore		.gitignore
24.pdf		24.pdf
24_presentation slides.pdf		24_presentation slides.pdf
README.md		README.md
add_solr_schema.ipynb		add_solr_schema.ipynb
combined_requirements.txt		combined_requirements.txt
confusion_matrix_mnli.png		confusion_matrix_mnli.png
confusion_matrix_roberta.png		confusion_matrix_roberta.png
confusion_matrix_textblob.png		confusion_matrix_textblob.png
confusion_matrix_vader.png		confusion_matrix_vader.png
data-preprocessing-for-solr.ipynb		data-preprocessing-for-solr.ipynb
inter_annotator_agreement.ipynb		inter_annotator_agreement.ipynb
majorityVoting.csv		majorityVoting.csv
mnli.png		mnli.png
popular_comment_Bolt_YWAnnotate.csv		popular_comment_Bolt_YWAnnotate.csv
popular_comment_Bolt_annotate_Merged.csv		popular_comment_Bolt_annotate_Merged.csv
popular_comment_Bolt_zh_annotate.csv		popular_comment_Bolt_zh_annotate.csv
random mnli.png		random mnli.png
random roberta.png		random roberta.png
reddit-data-extraction.ipynb		reddit-data-extraction.ipynb
roberta.png		roberta.png
roc label.png		roc label.png
roc mnli.png		roc mnli.png
roc_textblob.png		roc_textblob.png
roc_vader.png		roc_vader.png
sentiment_pred_bert_pretrain_and_annotator.csv		sentiment_pred_bert_pretrain_and_annotator.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SC4021 Information Retrieval-Group 24

Group 24 Members

Project Overview

Technical Overview

Pre-requisites to run the code (Exact versions are not required but recommended)

Instructions to run the code

Web crawling

Structure of crawled data

Data Indexing (Backend)

Frontend UI

Classification

Innovation

Labelled data

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SC4021 Information Retrieval-Group 24

Group 24 Members

Project Overview

Technical Overview

Pre-requisites to run the code (Exact versions are not required but recommended)

Instructions to run the code

Web crawling

Structure of crawled data

Data Indexing (Backend)

Frontend UI

Classification

Innovation

Labelled data

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages