This repository contains scripts and test data used for the development of a topic modeling pipeline in the context of the MiMoText project.
The pipeline is based on the following set of scripts by Christof Schöch: https://github.com/dh-trier/topicmodeling/. It is constantly being revised and developed.
- Extracting metadata
- Splitting texts
- Preprocessing: lemmatizing, POS-tagging, filtering by POS, stopword list and minimum word length
- Modeling with mallet (using the python wrapper of the gensim library)
- Postprocessing: statistics (different lists and matrices)
- Visualizing via pyLDAvis
- Generating heatmaps
- Generating wordclouds
Please install the following:
-
Python 3
-
Some additional libraries (with their respective dependencies):
-
"numpy", see: https://www.numpy.org/
-
"pandas", see: https://pandas.pydata.org/
-
"treetaggerwrapper", see: https://pypi.org/project/treetaggerwrapper/
-
"gensim", version 3.8.3, see: https://radimrehurek.com/gensim/install.html
- important note: Gensim 3.8.3 is the latest release to include the LDA mallet wrapper, which is essential for using the pipeline. So this gensim version is needed to run the pipeline.
-
"pyLDAvis", see: https://github.com/bmabey/pyLDAvis
-
"sklearn", see: https://pypi.org/project/scikit-learn/
-
"seaborn", see: https://seaborn.pydata.org/
-
"wordcloud", see: https://pypi.org/project/wordcloud/ (Note: Trying to install wordcloud on Windows often leads to difficulties. It might help to install and run the library with Python version 3.7)
-
TreeTagger, see https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
-
Please note: Follow the installation instructions given here; consider the differences between the different operating systems. It isn't necessary to download any language parameter files. They are already included in this folder.
-
-
For the modeling you have to install the mallet implementation first:
- mallet: http://mallet.cs.umass.edu/; download here: http://mallet.cs.umass.edu/topics.php (here you can find a helpful installation guide: https://programminghistorian.org/en/lessons/topic-modeling-and-mallet#installing-mallet)
- important: In order to run the scripts it is necessary to specify the path where you stored the mallet binary on your computer (see "mallet_path" in roman18_run_pipeline.py)
-
Please make sure you have installed Python 3, TreeTagger, mallet and the desired libraries.
-
Download and save this repository.
-
Save your text files (TXT) in datasets/[name-of-your-dataset]/full.
-
Now you can run the scripts.
-
Set your parameters in roman18_run_pipeline.py.
-
Run roman18_run_pipeline.py.
- It calls all required scripts in the correct order.
- You can change the following parameters:
- chunksize: size of text parts (number of tokens) into which the novels are split
- lang: language parameter to choose the model for POS-tagging; choose "fr" for modern French and "presto" for French of 16th/17th century.
- numtopics: number of topics created by the modeling
- passes: number of iterations
- modeling: Specify whether you want to perform the modelling with gensim or mallet.
- (only if chosen mallet:) optimize_interval: optimization of the topic model every "[chosen value]" iterations
- cats: category for which the most distinctive topics are visualized in heatmap
-
the splitted texts are saved in datasets/[name of dataset]/txt
-
the preprocessed texts are saved as lists of lemmas in results/[name of dataset]/pickles
-
the gensim model is saved in results/[name of dataset]/model
-
in results/[name of dataset]/ you also find statistical files, a file "visualization.html" and the heatmap visualizations
Files and script for preparing topic statements to feed into Wikibase.