Analyzing the sentiment development of news articles with the topic "migration" over time. This project was done in the course of the Text Analytics lecture at Heidelberg University.
pipeline.py
: All main components of the project are included in the pipeline. The pipeline is controlled by the config.ini.
-
article_selection
: Selects articles which are relevant for the analysis by a keyword search -
sentiment_analysis
: Contains the different approaches to analyze the sentimentbert.py
: Evaluation of sentiment by the BERT modelsentiment_dictionary.py
: Evaluation of sentiment through the SentiWS dictionarynegation_handling.py
: Improvement of the dictionary approach by handling the negation of wordsword2vec_sentiment.py
: Word2Vec Model to get synonyms of search words for qualitative analysisinference.py
: functions for applying and evaluating sentiment analysis methods on large batches of data
-
training
: Training code to fine-tune the BERT Model (the result of the training is published here) -
visualization
:dash_plot.py
: Dash application to show sentiment timelineswordcloud.py
: Generates word clouds of results from word2vec model
annotation
: Tool to annotate articles to generate training and test datascraping
: Generating the articlespipeline_test.py
: code tests
-
Clone this repository
-
Create a new virtual environment and activate it:
virtualenv env source env/bin/activate
-
Install the dependencies from the frozen-requirements.txt and then install the german language-package for spacy:
pip install -r frozen-requirements.txt python -m spacy download de
Either ask us for the scraped articles or use scraping/collect_articles.py
to build the dataset yourself (can take a few days).
For detailed instructions, refer to the docstring of collect_articles.py.
The expected article source files (*-sources.txt files) can be obtained from https://wortschatz.uni-leipzig.de/en/download/German). They are located inside of the .tar.gz files listed there. For this project we used the following archives:
deu_news_2007_100k.tar.gz
deu_news_2008_100k.tar.gz
deu_news_2009_100k.tar.gz
deu_news_2010_100k.tar.gz
deu_news_2011_100k.tar.gz
deu_news_2012_100k.tar.gz
deu_news_2013_100k.tar.gz
deu_news_2014_100k.tar.gz
deu_news_2015_100k.tar.gz
deu_newscrawl_2017_100k.tar.gz
deu_newscrawl_2018_100k.tar.gz
deu_newscrawl-public_2019_100k.tar.gz
The pipeline is controlled by the config.ini
file.
Configure it as you wish.
Then run
pipeline.py config.ini
We cannot upload our article data publicly due to copyright reasons. If you are interested in our dataset version and/or in the intermediate results, please email us so we can help you. Our finetuned BERT model can be found at https://huggingface.co/mdraw/german-news-sentiment-bert
We share a redacted version of our final project report here. Please refer to this document for more details on the background, methods and results of the associated project for which the code was written.
- Simon Lüdke (simon.luedke at gmx.de)
- Josephine Grau (josephine.grau at web.de)
- Martin Drawitsch (martin.drawitsch at gmail.com)