Data Science Project: Analysis of language variation and change

Data Science Master's Programme, University of Helsinki

Description

This app allows for the exploration of a corpus of historical letters using data science methods. The app has two sections.

The first is the part of speech, or POS tag, visualisation section. Here there are two tabs, containing bar and line graphs, to give the user a general overview of the dataset. This section contains various options to filter and restrict the dataset to allow the user more freedom in their exploration.

The second part of the app is the topic model section. This allows the user to generate, using the latent dirichlet allocation algorithm, a chosen number of “topics” from the data set. When properly filtered and parameterized, this allows the user to see which topics dominated the discussion in the letters. The app gives a wide array of options, so that the user can adjust based on their own questions of interest.

Links

App

Go to http://193.166.25.206:8050/app/overview

CLAWS7 Tagset

http://ucrel.lancs.ac.uk/claws7tags.html

Installation / How to get the app working locally

Clone the repository
Create a virtual environment with python3 -m venv venv
Activate virtual environment with source venv/bin/activate
Run pip install -r requirements.txt
Add data folder TCEECE to local project root, this is ignored by GIT to avoid spreading the data (see .gitignore file)
Start app with python index.py
Visit http://127.0.0.1:8050/app/overview

Usage

POS Visualisation

Line:

Shows the percentage of chosen categories
User can select:
- Year range
- period length (10 years, 20 years ...)
User can choose up to three lines to compare and options for each line are:
- Sender Sex (M,F)
- Pre-Made Class Grouping Classifications
  - Fine grained - Royalty (R) , Nobility (N) , Gentry Upper (GU), Gentry Lower (GL, G), Clergy Upper (CU), Clergy Lower (CL), Professional (P), Merchant (M), Other (O)
  - Regular - Royalty (R) , Nobility (N) , Gentry (GU, GL, G), Clergy (CU, CL), Professional (P), Merchant (M), Other (O)
  - Tripartite - Upper (R, N, GU, GL, G, CU), Middle (CL, P, M), Lower (O)
  - Bipartite - Gentry (R, N, GU, GL, G, CU), Non-Gentry (CL, P, M, O)
- Relationship (between sender and recipient)
  - Grouped: Family, Friends, Other relationships
  - Fine grained: Nuclear family, Other family, Family servant, Close friend, Other acquaintance
- POS-tags
User can set custom name for the graph and each line

Bar:

Shows number of words, letters or senders in the data that was selected in the line graph view
The differently-coloured bars correspond to the lines selected in the line graph view
Bars can be divided by:
- Sender's sex
- Sender's rank
- Sender's relationship with recipient

Topic model

BASIC PARAMETERS

Number of Topics

Number of topics to be generated by the LDA model.

Number of Iterations

Maximum number of iterations through the corpus when inferring the topic distribution.

ADVANCED PARAMETERS

Alpha

Parameter which determines the prior distribution over topic weights in documents.
Auto option: Learns an asymmetric prior from the corpus

Eta

Parameter which determines the prior distribution over word weights in each topic.
Auto option: Learns an asymmetric prior from the corpus

Set Seed

Option to choose a starting point for the generation of pseudorandom numbers to be used in the algorithm.

FILTERING

POS tag

Option to instruct the algorithm to only consider words of the chosen word type.

Stopwords

Option to instruct the algorithm to ignore words given in the list.

Filter Below Threshold

Option to instruct the algorithm to ignore words appearing in less than the selected number of documents.

Filter Above Threshold

Option to instruct the algorithm to ignore words appearing in more than the selected proportion of documents. Input is a decimal, between 0.01 and 1.

Sex

Option to instruct the algorithm to only consider letters of individuals from the chosen sex.

Rank

Option to instruct the algorithm to only consider letters of individuals from the chosen rank.

Relationship

Option to instruct the algorithm to only consider letters between senders and receivers of a chosen relationship status.

Time range

Option to instruct the algorithm to only consider letters sent between the chosen years.

Theory

Topic model

Latent Dirichlet Allocation

Latent dirichlet allocation, or LDA for short, is an algorithm used for topic modelling in natural language processing. Topics are groups of items, in this case tokens, which belong together due to their usage and prominence in the texts. Topics can be utilised to fully explore a corpora in unearthing and classifying the underlying themes present.

Preprocessing of the data plays a big part in obtaining significant results through this method, as the majority of words do not contribute any information about the topics themselves, but are there for other purposes, such as conveying the subjects in question, or linking together parts of the sentence. In our implementation, preprocessing includes transforming words into lowercase form, tokenization, lemmatization and filtering of tokens that consist only of one character. User can additionally selects stopwords that are be filtered out from the final data used for model training.

In brief, the algorithm works by iterating over documents. Firstly each word w in the document is assigned to one of k topics, at random. Then, conditional probabilities are calculated to represent the likelihood that w belongs to that topic. Then the words in the topics are updated based on these probabilities and the algorithm repeats.

The application is using a parallelized version of the LDA algorithm provided by the Gensim library for Python. More information on the Gensim implementation of the algorithm can be found from Gensim documentation. The article Hoffman, Blei, Bach: Online Learning for Latent Dirichlet Allocation provides more insights on the theoretical basis of the algorithm.

Iterations

Controls how many times the algorithm repeats a certain process, called the E-step, on each document. The E-step is a process during which the optimal values of the “variational parameters” are found for a document. The variational parameters are used to compute a lower bound for the log likelihood of the data, and when optimised will produce the tightest possible lower bound. Then inferences can be made about the log likelihood of the entire data, which is necessary for predicting which words belong to which topics.

Hyperparameters - Alpha and Eta

Alpha - Interpretation

Low alpha means each document is likely to consist of a few, or even one dominant topic. High alpha means each document is likely to consist of a mix of most of the topics.

Eta - Interpretation

Low eta means each topic is likely to be composed of only a few dominant words. High eta means each topic is likely to consist of a mixture of many words.

Ideally, we would like our documents to consist of only a few topics, and the words within those topics to belong to only one or a few of those topics. As such, alpha and eta can be adjusted to suit these purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 291 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
attribute_categories.py		attribute_categories.py
callbacks_cust.py		callbacks_cust.py
callbacks_pos.py		callbacks_pos.py
callbacks_tm.py		callbacks_tm.py
data_parser.py		data_parser.py
globals.py		globals.py
index.py		index.py
layout_404.py		layout_404.py
layout_cust.py		layout_cust.py
layout_overview.py		layout_overview.py
layout_pos.py		layout_pos.py
layout_tm.py		layout_tm.py
pos_categories.py		pos_categories.py
requirements.txt		requirements.txt
topic_model.py		topic_model.py

License

DSP2021-LanguageAnalysis/language-analysis

Folders and files

Latest commit

History

Repository files navigation

Data Science Project: Analysis of language variation and change

Table of Contents

Description

Links

App

CLAWS7 Tagset

Installation / How to get the app working locally

Usage

POS Visualisation

Line:

Bar:

Topic model

BASIC PARAMETERS

Number of Topics

Number of Iterations

ADVANCED PARAMETERS

Alpha

Eta

Set Seed

FILTERING

POS tag

Stopwords

Filter Below Threshold

Filter Above Threshold

Sex

Rank

Relationship

Time range

Theory

Topic model

Latent Dirichlet Allocation

Iterations

Hyperparameters - Alpha and Eta

Alpha - Interpretation

Eta - Interpretation

Credits and Licence

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 6

Languages