Skip to content

ao9000/SC4021-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

85 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SC4021 Information Retrieval-Group 24

Group 24 Members

Name Matriculation Number
Ong Zhi Ying, Adrian U2121883A
Takesawa Saori U2023120E
Cheong Yong Wen U2021159L
Kwok Zong Heng U2021027E
Mandfred Leow Hong Jie U2122023G
Mao Yiyun U2022609J

Project Overview

By 2030, Singapore aims to have a significant portion of its vehicle population comprised of electric vehicles (EVs) as part of its commitment to combat climate change. Given the incentives to switch to EVs, members of the public will soon need to decide on the brand and model of EV to purchase. To assist with this decision-making process, this project aims to design and develop an information retrieval system that can search and display public user comments related to EV brands and models from various social platforms. Additionally, the system will derive deeper insights using Natural Language Processing techniques such as sentiment analysis, subjectivity, and sarcasm classification.

Technical Overview

The project is divided into 4 main components:

  1. Web crawling
  2. Data Indexing (Backend)
  3. Frontend UI
  4. Classification

Pre-requisites to run the code (Exact versions are not required but recommended)

  1. Python 3.8.5
  2. Curl 8.4.0
  3. Apache Solr 9.5.0 (Place it under this repo's folder)
  4. Java 1.8.0_401

Instructions to run the code

  • Firstly, change directory into the base directory using cd SC4021-Project and install all required libraries using pip install -r combined_requirements.txt
  • Make sure you have the correct venu or conda environment activated

Web crawling

  1. Run reddit-data-extraction.ipynb -> This notebook contains step by step codes for extracting/crawling data from Reddit using predefined subreddits
  2. After crawling the data, run data-processing-for-solr.ipynb -> Executes basic data pre-processing before ingesting data into Solr

Structure of crawled data

  • {subreddit_name}-posts.csv -> Contains the top 100 posts from the subreddit
  • {subreddit_name}-comments.csv -> Contains all the comments associated with the top 100 posts of the subreddit
  • all-post.csv -> Contains the combined posts from all subreddits
  • all-comments.csv -> Contains the combined comments from all subreddits
  • cleaned_combined_data.csv -> Contains both the posts and comments (Normalized) from all subreddits after cleaning

Data Indexing (Backend)

  1. Make sure environment variable $JAVA_HOME is set to the correct Java JDK
  2. Make sure environment variable $PATH is set to the correct Apache Solr directory
  3. Open up CMD and run solr start
  4. You can navigate to localhost:8983/solr to access GUI for Apache Solr, however we will be using Curl to communicate with Solr
  5. Navigate to the jupyter notebook add_solr_schema.ipynb and run the cells to index the data into Apache Solr

Frontend UI

  1. Navigate to the frontend directory by running cd search_engine
  2. Run streamlit run app.py to start the streamlit app

Classification

Different classification innovations are implemented in various notebooks. These can be found under classification_final\models. The notebooks are as follows:

  1. Polarity_and_subjectivity_Detection.ipynb -> This notebook contains the code for detecting the polarity and subjectivity of the comments
  2. inter_annotation_agreement.ipynb -> This notebook contains the code for calculating the inter-annotator agreement
  3. Roberta_mnli_classification_majorityvoting.ipynb -> This notebook contains the code for evaluating dataset selection with 2 roberta models and experimenting voting ensemble
  4. Classification_Bert.ipynb-> Uses BERT pretrain model to predict sentimental analysis on comments

Innovation

Different Innovations are implemented with various notebooks.These can be found under Innovation\models. The notebooks are as follows

  1. Roberta_mnli_classification_majorityvoting.ipynb -> This notebook contains the code for evaluating dataset selection with 2 roberta models and experimenting voting ensemble.
  2. sarcasm_detection -> This notebook contains the code for evaluation for text that are sarcastic
  3. innovation_bert_and_stack_ensemble-> This notebook utilizes the annotated data and splitting into train/test dataset with ratio 75/25. It also contains fine tuning with BERT and comparing the results to a stack ensemble with BERT, RandomForest and LogisticRegression.

Labelled data

  • popular_comment_Bolt_YWAnnotate.csv -> Contains the labelled data for the Bolt EV labelled by 1 annotator
  • popular_comment_Bolt_zh_annotate.csv -> Contains the labelled data for the Bolt EV labelled by 1 annotator
  • popular_comment_Bolt_annotate_Merged.csv -> Contains the labelled data for the Bolt EV labelled by 2 annotators

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors