Data Carpentry with Python and NLTK

This repository contains teaching materials for Saarland Uni's Working With Corpora program. It's been adapted from the repository for teaching materials and additional resources used by Research Platforms Services at the University of Melbourne to teach Python, IPython, Jupyter and the Natural Language Toolkit (NLTK).

Essentially, the idea of both programs is to run free training in reproducible research methods and tools via a cloud platform, so that nobody has to worry about installation/operating system/specs problems. All code is written and executed within Jupyter Notebooks, allowing easy access to earlier input and output, as well as the rich display of text/images.

Learn more on WwC Python sessions at this URL. Want to join? Register by filling in this form. Subscribe to the mailing list and check the calendar to keep up-to-date with WwC activities.

All the materials used in the workshops are in this repository. In fact, cloning this repository will be our first activity together as a group. To do that, just open your terminal and type/paste:

git clone https://github.com/interrogator/wwc.git

Though we'll be working with blank notebooks in our training sessions, everything we cover lives as a complete notebook in the resources/completed-notebooks directory. These notebooks are useful for remembering or extending what you learned in during training. Alternatively, they may be useful for those who cannot attend our sessions face-to-face.

Below is a basic overview of the four-session lesson plan. You can click the headings to view complete versions of the Jupyter Notebooks we'll be using in each session. The materials are always evolving, and pull requests are always welcome.

Session 1: Orientation

In this session, you will learn how to use the Jupyter Notebook, as well as how to complete basic tasks with Python/NLTK.

Getting up and running
What exactly are Python, Jupyter and NLTK?
Introductions to Jupyter
Overview of basic Python concepts: significant whitespace, input/output types, commands and arguments, etc.
Introduction to NLTK
Quickstart: US Inaugural Addresses Corpus
Plot key terms in the inaugural addresses longitudinally
Discussion: Why might we want to use NLTK? What are its limitations?
Working with variables
Writing functions
Creating frequency distributions

Session 2: Corpus linguistic tasks

In this session, we put our existing skills to work in order to investigate the corpora that come bundled with NLTK. The major kinds of processes we cover are:

Sentence splitting
Tokenisation
Keywords
n-grams
Collocates
Concordancing

Session 3: Working with corpora

POS tagging
Lemmatisation
Exploring annotated data
Writing a concordancer

Session 4: Regular expressions

This lesson focusses on regular expressions, as implemented in Python.

Introduction to the syntax
Regular expressions in Python: compile(), match(), findall(), search()
Regular expressions in the Shell terminal: sed, grep
Resources around the web
- Checkers
- Cheatsheets
- Crosswords

Session 5: HTML and XML

This session introduces HTML and XML markup, and how they can be manipulated with Python using beautifulsoup and lxml.

Session 6: File operations

Python can be used to work with local and remote files. This session centres on using the os, glob and fnmatch modules to find, create, copy, move and delete files and directories.

Session 7: Getting the most out of what we've learned

Our final session involves:

Answering important issues raised by students in earlier sessions
Discussing local workflows:
- How do we get everything set up on own own devices?
- How do we get help when things go wrong?
- How should I store/share my code?
Brainstorming ideas for the future

Adding to this repository

You're more than welcome to submit a pull request with changes to our course materials.

The .ipynb files used by both students and instructors are automatically generated using notedown. Accordingly, the best way to modify our course materials is to update the .md file, rather than the .ipynb file. These also live in resources/completed-notebooks.

Longer functions/solutions to challenges may be archived in resources/scripts.py, so that they may be imported by instructors/helpers on students' notebooks if need be.

Name		Name	Last commit message	Last commit date
Latest commit History 139 Commits
resources		resources
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
forum.txt		forum.txt
session-01.ipynb		session-01.ipynb
session-02.ipynb		session-02.ipynb
session-03.ipynb		session-03.ipynb
session-04.ipynb		session-04.ipynb
session-05.ipynb		session-05.ipynb
session-05.md		session-05.md
session-06.ipynb		session-06.ipynb
southparkmovie.txt		southparkmovie.txt
stanford-tregex.jar		stanford-tregex.jar
tcf04-karin-wl.xml		tcf04-karin-wl.xml
tregex.sh		tregex.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Carpentry with Python and NLTK

Session 1: Orientation

Session 2: Corpus linguistic tasks

Session 3: Working with corpora

Session 4: Regular expressions

Session 5: HTML and XML

Session 6: File operations

Session 7: Getting the most out of what we've learned

Adding to this repository

About

Releases

Packages

Languages

License

chozelinek/wwc

Folders and files

Latest commit

History

Repository files navigation

Data Carpentry with Python and NLTK

Adding to this repository

About

Resources

License

Stars

Watchers

Forks

Languages