Unveiling Sentiments and Topics in COVID-19 Vaccine Comments on YouTube Over Time: from the First Vaccine Approval to the Post-Pandemic Era
The project was part of the Text and Multimedia Mining course of the first semester in the Artificial Intelligence master program, Radboud University, Nijmegen, Netherlands
While extensive research has examined emotional responses, public sentiments, and topic modeling related to COVID-19 vaccines, the majority of them have primarily utilized data from the early breakout of coronavirus or vaccine rollout until 2021 or 2022. However, a deeper insight into the public’s opinions can be gained by taking into account discourses on YouTube videos related to the coronavirus vaccines from the first vaccine approval until the post-pandemic era. The study examines a unique dataset, specifically collected to record public sentiment and discourse 's evolution through the pandemic's phases. The extracted dataset consists of 907,380 English comments from 1,195 videos and 592,780 different users retrieved by searching with the query "Covid-19 vaccines" through the YouTube API. The study analyzes this dataset with a focus on comments published from August 11, 2020, when it was the world’s first COVID-19 vaccine approval from Russia to December 2023, which is seven months beyond the official declaration that COVID-19 is no longer considered a global threat (May 2023). Through BERT language models, this research work presents a monthly sentiment analysis and topic modeling, offering insights into how discussions evolved as vaccine development extended and the pandemic progressed while also employing novel research techniques. BERTsent and BERTopic pre-trained models were leveraged to analyze the sentiment and topics of the YouTube comments, showing the accurate performance of these state-of-the-art models in capturing the emerging attitudes and concerns of the public regarding COVID-19 vaccines. To ensure the external validity of the findings, for the sentiment analysis, a sample subset was manually labeled to compare with the BERT model's predictions, resulting in a high level of agreement (~85.5% F1-score). Related research studies and relevant events have been used to verify and confirm the output from both the sentiment analysis and topic modeling tasks, demonstrating the effectiveness of BERT models in capturing complex sentiments and topics, although the models are not explicitly trained on these data. The sentiment analysis highlights persistent negative sentiments punctuated by spikes corresponding to key vaccine-related events. Topic modeling reveals the efficacy, effectiveness, and safety of vaccines as a dominant theme, with significant discourse fluctuations reflecting major events. Sentiment trends within major topics show a persistent negative sentiment towards vaccine efficacy and safety, while also capturing moments of positive sentiment reflecting support for healthcare efforts and vaccine developments. This extended timeframe of the study allows for a subtle insight into the public discourse through the various stages of the pandemic's impact while also highlighting the critical role of social media platforms in influencing public opinions during global health emergencies.
-
The main implementation of the project is displayed in the
main.ipynb
notebook, where the Kaggle GPU resources were used for the sentiment analysis and topic modeling. -
The code used to extract and filter the dataset is at the
data
folder. -
The
charts.html
contains all the plots generated for the project. -
The
requirements.txt
contains the necessary modules to extract the dataset.