| Name | Matriculation Number |
|---|---|
| Ong Zhi Ying, Adrian | U2121883A |
| Takesawa Saori | U2023120E |
| Cheong Yong Wen | U2021159L |
| Kwok Zong Heng | U2021027E |
| Mandfred Leow Hong Jie | U2122023G |
| Mao Yiyun | U2022609J |
By 2030, Singapore aims to have a significant portion of its vehicle population comprised of electric vehicles (EVs) as part of its commitment to combat climate change. Given the incentives to switch to EVs, members of the public will soon need to decide on the brand and model of EV to purchase. To assist with this decision-making process, this project aims to design and develop an information retrieval system that can search and display public user comments related to EV brands and models from various social platforms. Additionally, the system will derive deeper insights using Natural Language Processing techniques such as sentiment analysis, subjectivity, and sarcasm classification.
The project is divided into 4 main components:
- Web crawling
- Data Indexing (Backend)
- Frontend UI
- Classification
- Python 3.8.5
- Curl 8.4.0
- Apache Solr 9.5.0 (Place it under this repo's folder)
- Java 1.8.0_401
- Firstly, change directory into the base directory using
cd SC4021-Projectand install all required libraries usingpip install -r combined_requirements.txt - Make sure you have the correct venu or conda environment activated
- Run
reddit-data-extraction.ipynb-> This notebook contains step by step codes for extracting/crawling data from Reddit using predefined subreddits - After crawling the data, run
data-processing-for-solr.ipynb-> Executes basic data pre-processing before ingesting data into Solr
- {subreddit_name}-posts.csv -> Contains the top 100 posts from the subreddit
- {subreddit_name}-comments.csv -> Contains all the comments associated with the top 100 posts of the subreddit
- all-post.csv -> Contains the combined posts from all subreddits
- all-comments.csv -> Contains the combined comments from all subreddits
- cleaned_combined_data.csv -> Contains both the posts and comments (Normalized) from all subreddits after cleaning
- Make sure environment variable
$JAVA_HOMEis set to the correct Java JDK - Make sure environment variable
$PATHis set to the correct Apache Solr directory - Open up CMD and run
solr start - You can navigate to
localhost:8983/solrto access GUI for Apache Solr, however we will be using Curl to communicate with Solr - Navigate to the jupyter notebook
add_solr_schema.ipynband run the cells to index the data into Apache Solr
- Navigate to the frontend directory by running
cd search_engine - Run
streamlit run app.pyto start the streamlit app
Different classification innovations are implemented in various notebooks. These can be found under classification_final\models.
The notebooks are as follows:
Polarity_and_subjectivity_Detection.ipynb-> This notebook contains the code for detecting the polarity and subjectivity of the commentsinter_annotation_agreement.ipynb-> This notebook contains the code for calculating the inter-annotator agreementRoberta_mnli_classification_majorityvoting.ipynb-> This notebook contains the code for evaluating dataset selection with 2 roberta models and experimenting voting ensembleClassification_Bert.ipynb-> Uses BERT pretrain model to predict sentimental analysis on comments
Different Innovations are implemented with various notebooks.These can be found under Innovation\models.
The notebooks are as follows
Roberta_mnli_classification_majorityvoting.ipynb-> This notebook contains the code for evaluating dataset selection with 2 roberta models and experimenting voting ensemble.sarcasm_detection-> This notebook contains the code for evaluation for text that are sarcasticinnovation_bert_and_stack_ensemble-> This notebook utilizes the annotated data and splitting into train/test dataset with ratio 75/25. It also contains fine tuning with BERT and comparing the results to a stack ensemble with BERT, RandomForest and LogisticRegression.
popular_comment_Bolt_YWAnnotate.csv-> Contains the labelled data for the Bolt EV labelled by 1 annotatorpopular_comment_Bolt_zh_annotate.csv-> Contains the labelled data for the Bolt EV labelled by 1 annotatorpopular_comment_Bolt_annotate_Merged.csv-> Contains the labelled data for the Bolt EV labelled by 2 annotators