DHOxSS - Text to Tech

Materials for the Text to Tech workshop at the Digital Humanities Oxford Summer School by Kaspar von Beelen, Mariona Coll Ardanuy and Federico Nanni.

Google Colab

The workshop will mostly rely on Google Colab for the hands-on activities.

Day 1

Welcome slides
Intro to Python (a)
Intro to Python (b)
Intro to Python (c)
Functions

Day 2

Opening Files
Basic Text Processing
Regular Expressions
List, sets and tuples
Dictionaries and JSON
Text Processing Exercises
Data Structures Exercises

Day 3

Libraries
Working with tabular data
Exercises on Information Retrieval

Day 4

Introduction to Machine Learning for NLP: slides
Intro to NLP (1)
Intro to NLP (2)
Intro to NLP (3)
Intro to NLP (4)
Introduction to Language Modelling: slides
Word embeddings (1)
Word embeddings (2)

Day 5

Introduction to Foundation Models and Transfer Learning: slides
Transformers for NLP
Introduction to Generative AI: slides
Poking LLMs with HuggingFace
Using local LLMs

Local installation

Our entire course will be on Google Colab. If you want to set up the notebooks locally on your machine, these are the instructions. However bear in mind that some of the tools might not work well on certain ~~old~~ laptops (especially from Day 4 onwards).

Install Anaconda
Download the content of this repository and unzip
Open Anaconda Navigator
From Anaconda, create environment py39
Install JupyterLab in environment
Launch JupyterLab
Open terminal in Jupyter Lab
Write the following in the terminal, step-by-step:
- conda activate py39
- Update pip: pip install --upgrade pip
- Change directory using the cd command in the terminal until you are in the course folder. There you should run: pip install -r requirements.txt
- Add the environment to Jupyter (following instructions from here) or by running ipython kernel install --user --name=py39 Then you can already start using the notebooks: select as kernel py39 (restart JupyterLab if the correct kernel does not show)

You find more detailed instructions here.

Data

Datasets used:

The Living Machines atypical animacy dataset, freely available here.
MuSe: The Musical Sentiment Dataset Muse
A historical dataset on popular baby names in the United States from 1880 onwards. Available here.
A sample of British Library 19th Century Books collected from here.
A sample of British Newspapers articles, digitized by Heritage Made Digital.

Background reading (optional):

Walsh, Melanie. Introduction to Cultural Analytics & Python, https://melaniewalsh.github.io/Intro-Cultural-Analytics/welcome
Karsdorp, Folgert. Python Programming for Humanists. http://www.karsdorp.io/python-course/.
Montfort, Nick. Exploratory Programming for the Arts and Humanities. Cambridge, Massachusetts: The MIT Press, 2016. https://mitpress.mit.edu/books/exploratory-programming-arts-and-humanities.
Sinclair, Stéfan, and Geoffrey Rockwell. The Art of Literary Text Analysis. Melissa Mony., 2016. https://github.com/sgsinclair/alta/blob/77b256f7c3ff3ceb6643d53da401096c8cdcc468/ipynb/ArtOfLiteraryTextAnalysis.ipynb.
Graham, Shawn, Ian Milligan, Scott Weingart. The Historian's Macroscope. Under contract with Imperial College Press. Open Draft Version, Autumn 2013, http://themacroscope.org
Downey, Allen, Peter Wentworth, Jeffrey Elkner, and Chris Meyers. “How To Think Like A Computer Scientist: Learning with Python 3.” (2016).
Karsdorp, Folgert, Mike Kestemont and Allen Riddell, Humanities Data Analysis: Case Studies with Python, https://www.humanitiesdataanalysis.org

Advanced reading list (optional):

Jurafsky, Daniel, and J. H. Martin. "Vector semantics and embeddings." Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (2019): 94-122. https://web.stanford.edu/~jurafsky/slp3/6.pdfLinks to an external site.
Smith, Noah A. "Contextual word representations: A contextual introduction." arXiv preprint arXiv:1902.06006 (2019). https://arxiv.org/pdf/1902.06006.pdfLinks to an external site.
Boleda, Gemma. "Distributional semantics and linguistic theory." Annual Review of Linguistics 6 (2020): 213-234. https://arxiv.org/pdf/1905.01896.pdfLinks to an external site.
Rogers, Anna. "Changing the World by Changing the Data." arXiv preprint arXiv:2105.13947 (2021). https://arxiv.org/pdf/2105.13947.pdfLinks to an external site.
Wevers, Melvin, and Marijn Koolen. "Digital begriffsgeschichte: Tracing semantic change using word embeddings." Historical Methods: A Journal of Quantitative and Interdisciplinary History 53, no. 4 (2020): 226-243. https://www.tandfonline.com/doi/pdf/10.1080/01615440.2020.1760157Links to an external site.
Bender, Emily M., and Alexander Koller. "Climbing towards NLU: On meaning, form, and understanding in the age of data." In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5185-5198. 2020. https://www.aclweb.org/anthology/2020.acl-main.463.pdfLinks to an external site.
Grimmer, Justin, Margaret E. Roberts, and Brandon M. Stewart. "Text as data: A new framework for machine learning and the social sciences." Princeton University Press, 2022. https://press.princeton.edu/books/paperback/9780691207551/text-as-data

Other Resources

This course is based upon many previous resources. Apart from the ones above:

Nilo Pedrazzini's introduction notebook to Word2Vec.
Materials from previous editions of this course, written by Barbara McGillivray and Gard Jenset
The Turing's Research Software Engineering and Research Data Science Courses
The Turing Way
The Turing Digital Humanities & Research Software Engineering Summer School
Fede's Computational Text Analysis Course

Resources mentioned during the workshop: slides

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

DHOxSS - Text to Tech

Google Colab

Day 1

Day 2

Day 3

Day 4

Day 5

Local installation

Data

Background reading (optional):

Advanced reading list (optional):

Other Resources

Files

README.md

Latest commit

History

README.md

File metadata and controls

DHOxSS - Text to Tech

Google Colab

Day 1

Day 2

Day 3

Day 4

Day 5

Local installation

Data

Background reading (optional):

Advanced reading list (optional):

Other Resources