To run the script locally, you would need to install Python and the required libraries in a virtual environment.
- Create a virtual environment:
python -m venv /path/to/new/virtual/environment
- Run the virtual environment:
- Linux (bash)
source /path/to/new/virtual/environment/bin/activate
- Windows
.\Scripts\activate.bat
- Install the required libraries from the
requirements.txt
file:pip install -r requirements.txt
Use the following command to run the main.py
:
python main.py
Each of the implemented project tasks can be run by uncommenting their function calls in the main.py
file.
Most of the functions have the following format:
task_function_name(start_index, end_index)
Parameters:
start_index
: indicates the first article document index number to run, along with all their comment documentsend_index
: indicates the last article document index number to run, along with all their comment documents
- Identify a topic in climate change which has been extensively commented on in the media, and assign it to the
keyword
variable.
keyword = 'carbon emissions'
- Use a News API (The Guardian API) to retrieve 20 search outcomes related to the Climate Change topic specified in Task 1
results = retrieve_search_results(keyword, nr_of_outcomes)
- Parameters:
keyword
: the topic phrase that will be used in the search querynr_of_outcomes
: the maximum number of articles to retrieve (in this case,nr_of_outcomes = 40
)- Note: not all articles found in the Opinion section of The Guardian have comments, so we retrieve a larger number of results and check if they contain comments before saving them as text documents.
- Extract the text from the article
- Use the article's shortUrl to retrieve the top comments of the article (about 2-5 comments per article)
news_articles_w_comments = parse_news_articles(results)
get_comments_from_articles(news_articles_w_comments)
- Pre-processing: stopword removal, stemming, lemmatization & tokenization
- Get the 20 most frequent terms of each document and draw a histogram
- Output the Jaccard Index between the most frequent terms found in the article and each of their comments
view_most_frequent_terms(1, 21)
- Get the positive & negative sentiment vector for each News Article
- Get the positive & negative sentiment vector for each Comment
- Calculate the Pearson correlation of the sentiment vector values between each News Article & their Comments
view_sentiments(1, 21)
View the Histogram of the negative entities found in each document, and their influence on their sentences
- Get a list of Negative Emotion Wordings from the Empath corpus
- Identify the entities that the Negative Sentiment Words are associated to (if any)
- Generate a histogram of these entities (across all the Comments of a News Article)
view_user_disagreement_extent(1, 21)
View the Histogram of the positive entities found in each document, and their influence on their sentences
- Get a list of Positive Emotion Wordings from the Empath corpus
- Identify the entities that the Positive Sentiment Words are associated to (if any)
- Generate a histogram of these entities (across all the Comments of a News Article)
view_user_agreement_extent(1, 21)
- Generate a list of agreement words
- Generate a list of disagreement words
- Count the occurrences of all agreement-related words in all the Comment documents of each News Article
- Count the occurrences of all disagreement-related words in all the Comments documents of each News Article
- Generate a histogram
view_commenter_behavior(1, 21)