Data Science Master's Programme, University of Helsinki
- Description
- Links
- Installation
- Usage
- Theory
- Credits and Licence
- Backlog
This app allows for the exploration of a corpus of historical letters using data science methods. The app has two sections.
The first is the part of speech, or POS tag, visualisation section. Here there are two tabs, containing bar and line graphs, to give the user a general overview of the dataset. This section contains various options to filter and restrict the dataset to allow the user more freedom in their exploration.
The second part of the app is the topic model section. This allows the user to generate, using the latent dirichlet allocation algorithm, a chosen number of “topics” from the data set. When properly filtered and parameterized, this allows the user to see which topics dominated the discussion in the letters. The app gives a wide array of options, so that the user can adjust based on their own questions of interest.
Go to http://193.166.25.206:8050/app/overview
http://ucrel.lancs.ac.uk/claws7tags.html
- Clone the repository
- Create a virtual environment with
python3 -m venv venv
- Activate virtual environment with
source venv/bin/activate
- Run
pip install -r requirements.txt
- Add data folder
TCEECE
to local project root, this is ignored by GIT to avoid spreading the data (see.gitignore
file) - Start app with
python index.py
- Visit
http://127.0.0.1:8050/app/overview
- Shows the percentage of chosen categories
- User can select:
- Year range
- period length (10 years, 20 years ...)
- User can choose up to three lines to compare and options for each line are:
- Sender Sex (M,F)
- Pre-Made Class Grouping Classifications
- Fine grained - Royalty (R) , Nobility (N) , Gentry Upper (GU), Gentry Lower (GL, G), Clergy Upper (CU), Clergy Lower (CL), Professional (P), Merchant (M), Other (O)
- Regular - Royalty (R) , Nobility (N) , Gentry (GU, GL, G), Clergy (CU, CL), Professional (P), Merchant (M), Other (O)
- Tripartite - Upper (R, N, GU, GL, G, CU), Middle (CL, P, M), Lower (O)
- Bipartite - Gentry (R, N, GU, GL, G, CU), Non-Gentry (CL, P, M, O)
- Relationship (between sender and recipient)
- Grouped: Family, Friends, Other relationships
- Fine grained: Nuclear family, Other family, Family servant, Close friend, Other acquaintance
- POS-tags
- User can set custom name for the graph and each line
- Shows number of words, letters or senders in the data that was selected in the line graph view
- The differently-coloured bars correspond to the lines selected in the line graph view
- Bars can be divided by:
- Sender's sex
- Sender's rank
- Sender's relationship with recipient
- Number of topics to be generated by the LDA model.
- Maximum number of iterations through the corpus when inferring the topic distribution.
- Parameter which determines the prior distribution over topic weights in documents.
- Auto option: Learns an asymmetric prior from the corpus
- Parameter which determines the prior distribution over word weights in each topic.
- Auto option: Learns an asymmetric prior from the corpus
- Option to choose a starting point for the generation of pseudorandom numbers to be used in the algorithm.
- Option to instruct the algorithm to only consider words of the chosen word type.
- Option to instruct the algorithm to ignore words given in the list.
- Option to instruct the algorithm to ignore words appearing in less than the selected number of documents.
- Option to instruct the algorithm to ignore words appearing in more than the selected proportion of documents. Input is a decimal, between 0.01 and 1.
- Option to instruct the algorithm to only consider letters of individuals from the chosen sex.
- Option to instruct the algorithm to only consider letters of individuals from the chosen rank.
- Option to instruct the algorithm to only consider letters between senders and receivers of a chosen relationship status.
- Option to instruct the algorithm to only consider letters sent between the chosen years.
Latent dirichlet allocation, or LDA for short, is an algorithm used for topic modelling in natural language processing. Topics are groups of items, in this case tokens, which belong together due to their usage and prominence in the texts. Topics can be utilised to fully explore a corpora in unearthing and classifying the underlying themes present.
Preprocessing of the data plays a big part in obtaining significant results through this method, as the majority of words do not contribute any information about the topics themselves, but are there for other purposes, such as conveying the subjects in question, or linking together parts of the sentence. In our implementation, preprocessing includes transforming words into lowercase form, tokenization, lemmatization and filtering of tokens that consist only of one character. User can additionally selects stopwords that are be filtered out from the final data used for model training.
In brief, the algorithm works by iterating over documents. Firstly each word w in the document is assigned to one of k topics, at random. Then, conditional probabilities are calculated to represent the likelihood that w belongs to that topic. Then the words in the topics are updated based on these probabilities and the algorithm repeats.
The application is using a parallelized version of the LDA algorithm provided by the Gensim library for Python. More information on the Gensim implementation of the algorithm can be found from Gensim documentation. The article Hoffman, Blei, Bach: Online Learning for Latent Dirichlet Allocation provides more insights on the theoretical basis of the algorithm.
Controls how many times the algorithm repeats a certain process, called the E-step, on each document. The E-step is a process during which the optimal values of the “variational parameters” are found for a document. The variational parameters are used to compute a lower bound for the log likelihood of the data, and when optimised will produce the tightest possible lower bound. Then inferences can be made about the log likelihood of the entire data, which is necessary for predicting which words belong to which topics.
Low alpha means each document is likely to consist of a few, or even one dominant topic. High alpha means each document is likely to consist of a mix of most of the topics.
Low eta means each topic is likely to be composed of only a few dominant words. High eta means each topic is likely to consist of a mixture of many words.
Ideally, we would like our documents to consist of only a few topics, and the words within those topics to belong to only one or a few of those topics. As such, alpha and eta can be adjusted to suit these purposes.