Research Workshop on Computational Tools for Digital Data Collection

If you have comments, questions, or suggestions, then create an issue.

Description

The focus of this workshop is on digital data collection using R (most cases), Python, and UNIX command-line tools. Three lecture-style sessions will introduce graduate students to advanced techniques in web-scraping, pdf-scraping, and social media scraping. Three seminar-style courses will provide graduate students with the opportunity to receive feedback on strategies for collecting data.

The objective of this workshop is practical: graduate students will develop and execute data collections strategies in each of the three thematic modules, with the final deliverable being three complete and clean datasets. As such, we will expect graduate students involved in the workshop to identify resources---e.g., administrative databases, archival documents, social media accounts---that they wish to scrape.

The emphasis of this course is on data collection, rather than data analysis. However, as the goal of data collection is typically analytical, we will assume a familiarity with conventional approaches to statistical inference in the social sciences.

Logistics

Co-instructors

Jae Yeon Kim

jaeyeonkim@berkeley.edu

Nicholas Kuipers

nkuipers@berkeley.edu

Time and Location

Date: TBD

Location: Zoom

All course materials will be posted on Github at https://github.com/jaeyk/digital_data_collection_workshop, including class notes, code demonstrations, sample data, and assignments.

Accessibility

This class is committed to creating an environment in which everyone can participate, regardless of background, discipline, or disability. If you have a particular concern, please come to me as soon as possible so that we can make special arrangements.

Books and Other Resources

There are no official textbooks for this class. Please see the references (will be updated throughout the semester) for additional references and the style guides for efficient programming and project management.

Computer Requirements

The software needed for the course is as follows:

Access to the UNIX command line (e.g., a Mac laptop, a Bash wrapper on Windows)
Git
R and RStudio (latest versions)
Anaconda and Python 3 (latest versions)

This requires a computer that can handle all this software. Almost any Mac will do the job. Most Windows machines are fine too if they have enough space and memory.

You must have all the software downloaded and installed PRIOR to the first day of class.

See this guideline for more information on installation.

Curriculum Outline / Schedule

The schedule is subject to change based on the class's rate of progress.

To view the course contents interactively, please .
To view the HTML rendered course contents, please click [Notebook].

Techniques in automating data collection workflow [Notebook]

September 16, 2020: Automating data collection workflow
- Instructor: Kim
- Style: Lecture
- Description: introduction to the tidyverse; discussion of efficient and reproducible ways to collect and wrangle data
- R Packages: dplyr, purrr
- References:
  - Kim, How to Automate Repeated Things in R (GitHub)
  - Kim, Advanced Wrangling Workshop in R (GitHub)

Techniques in social media scraping [Notebook 1] [Notebook 2] [Notebook 3] [Online book chapter]

September 30, 2020: Introduction to tweet parsing
- Lead Instructor: Kim
- Style: Lecture
- Description: Introduction to techniques of collecting and parsing social media data with emphasis on Twitter
- Command-line tool: twarc
  - Installation guideline
- R packages:
  - RESTful API: tweetscores, twitteR, rtweet
  - Streaming API: streamR
  - Parsing: tidyjson, tidytweetjson
- References:
  - Pablo Barbara, LSE Social Media Workshop
  - Steinert-Threlkeld, 2020 APSA Short Course Generating Event Data From Social Media
  - Kim, Large-scale Twitter Analysis on COVID-19 and Anti-Asian Climate (GitHub)
October 7, 2020: Tweet parsing workshop
- Instructor: Kim + Kuipers
- Style: Seminar
- Description: Graduate students provide/receive feedback on tweet parsing data collection strategies

Techniques in pdf-scraping

October 14, 2020: No workshop -- Indigenous peoples’ day
October 21, 2020: PDF-parsing
- Lead instructor: Kuipers
- Style: Lecture
- Description: introduction to techniques of pdf-scraping; where to look for documents; how to know what to pre-process by hand; identifying recurring patterns in text to exploit for data wrangling; parallel processing
- R Packages: tesseract, magick, zoo, parallel, pdftools
- References:
  - Mock, Bear and pdftools
  - Vaughan, Tidying Excel cash flow spreadsheets using R
October 28, 2020: PDF-parsing workshop
- Instructor: Kuipers + Kim
- Style: Seminar
- Description: Graduate students provide/receive feedback on PDF-parsing data collection strategies

Techniques in web-scraping

November 4, 2020: Web-scraping
- Instructor: Kuipers
- Style: Lecture
- Description: introduction to techniques of web-scraping; identifying and exploiting underlying database structures; knowing when to quit
- R Packages: rvest, jsonlite, zoo, xml, ralger
- Chrome plugin: SelectorGadget
- References:
  - Terman, 3I: Web Scraping and Data Management in R (GitHub)
  - Vaughan, Which RStudio blog posts “pleased” Hadley? A tidytext + web scraping analysis
November 11, 2020: Web-scraping workshop
- Instructor: Kuipers + Kim
- Style: Seminar
- Description: Graduate students provide/receive feedback on web-scraping data collection strategies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A_syllabus.md

A_syllabus.md

Research Workshop on Computational Tools for Digital Data Collection

Description

Logistics

Co-instructors

Time and Location

Accessibility

Books and Other Resources

Computer Requirements

Curriculum Outline / Schedule

Techniques in automating data collection workflow [Notebook]

Techniques in social media scraping [Notebook 1] [Notebook 2] [Notebook 3] [Online book chapter]

Techniques in pdf-scraping

Techniques in web-scraping

Files

A_syllabus.md

Latest commit

History

A_syllabus.md

File metadata and controls

Research Workshop on Computational Tools for Digital Data Collection

Description

Logistics

Co-instructors

Time and Location

Accessibility

Books and Other Resources

Computer Requirements

Curriculum Outline / Schedule

Techniques in automating data collection workflow [Notebook]

Techniques in social media scraping [Notebook 1] [Notebook 2] [Notebook 3] [Online book chapter]

Techniques in pdf-scraping

Techniques in web-scraping