04-week4.Rmd

# Week 4: Natural language, complexity, and similarity

This week we will be delving more deeply into how language is used in text. In previous weeks, we have tried out two main techniques both of which rely, in different ways, on counting words. This week, we will be thinking about some more sophisticated techniques to identify and measure language use, as well as how to compare texts to each other. The article by @gomaa_survey_2013 provides an overview of different approaches. We will be covering these technical dimensions in the lecture.

The article by @urman_matter_2021 investigates a key question in contemporary communications research---what information we are exposed to online---and shows how we might compare between web search results using similarity measures. The @schoonvelde_liberals_2019 article, on the other hand, looks at the "complexity" of texts, and compares how politicians of different ideological stripes communicate.

Questions:

1. How do we measure linguistic complexity/sophistication?
2. What biases might be involved in measuring sophistication?
3. What other applications might there be for similarity measures?

**Required reading**:

- @urman_matter_2021
- @schoonvelde_liberals_2019
- @gomaa_survey_2013

**Further reading**:

- @voigt_language_2017
- @peng_quantitative_2002
- @lowe_understanding_2008
- @bail_fringe_2012
- @ziblatt_wealth_2020
- @benoit_measuring_2019

**Slides**:

- Week 4 [Slides](https://docs.google.com/presentation/d/1SpEZVfejaul9dQyeaIvQlSIyLVyRJ8w5yMzr7LoVnc4/edit?usp=sharing)