Hello! This git repository contains the supplementary materials for John McLevey (2021) Doing Computational Social Science, Sage, UK. A few small things have changed in this repo since the book was published. If you are using this repo alongside the book, please take a moment to read this file carefully. It will make it easier to access and use the resources you are looking for. If you run into any problems, please file an issue.
The book refers to a DCSS virtual environment that can be created from the YAML file environment.yml
, which you can find in the root directory of this repo. It has all the packages and other dependencies needed to execute the code in the book. You can create it on your own system using Conda, as described in the book. From this directory:
conda env create -f environment.yml
Conda can be a little slow, so be patient. Once it has downloaded and installed the packages, you can activate and use the virtual environment as expected:
conda activate dcss
But... what if things don't go as expected?
As far as virtual environments go, this is a pretty big one. If you are on an older system, or one with limited memory, you might run into some installation issues. In that case, we recommend that you use Mamba instead of Conda to install the environment. Mamba is a much faster and more efficient cross-platform package manager than Conda, and can easily handle installing the DCSS environment. You may want to use it even if you aren't using a system with limited memory!
You can find the installation instructions for Mamba here. In most cases, you should be able to just run the command below, which downloads the Mamba install script (with curl
) and then runs the installer.
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh"
bash Mambaforge-$(uname)-$(uname -m).sh
Once you have Mamba installed, you can use it in place of Conda.
mamba env create -f environment.yml
and to activate,
mamba activate dcss
The DCSS environment comes with Tiago Peixoto's graph-tool
, which is challenging to install on Windows systems. If you are running into this issue and don't have access to a system running linux or macOS, then you can install a version of the DCSS environment that does not include graph-tool
. That environment is environment-windows-no-gt.yml
, and you can also find in the root directory.
mamba env create -f environment-windows-no-gt.yml
mamba activate dcss
Installing the DCSS environment only helpful if you can make use of it in your development environment of choice. Throughout the book, we assume that you're using Jupyter (lab/notebook) to follow along with the code examples. To run code in the DCSS environment from within Jupyter, perform the following steps:
- Activate the DCSS environment using
conda activate dcss
- Register the DCSS environment as a Jupyter kernel using
python -m ipykernel install --user --name=dcss
If, when attempting this, you receive an error about ipykernel
not being installed, you can install it using conda install ipykernel
while the DCSS environment is active.
If successful, you should be able to select a dcss
kernel from the 'Change Kernel' menu in Jupyter (typically found in the 'Kernel' dropdown at the top of the screen).
You can install the dcss
package using pip:
pip install dcss
The package source code is also hosted in this repo. If you like, you can browse it in PATH TO PACKAGE SOURCE CODE
.
The data
directory contains all of the datasets that I use in the book, or in the accompanying problem sets (see below). However, some of these datasets are very large, are updated frequently, or are maintained by other people. In most cases, what you will find here are (1) large samples drawn from the original source datasets and (2) instructions on how to access the full datasets from their original sources.
In data/
, you will find:
- A filtered and subsetted of the Version 11 Varieties of Democracy data, released in 2021.
- A large sample from the Canadian Hansard.
- A large sample from the British Hansard.
- A variety of social network datasets collected by the SocioPatterns Team.
- A variety of social network datasets collected by the Copenhagen Networks Study Team.
- A "Freedom on the Net" dataset published by Freedom House in 2020.
- A collection of small datasets related to campaign spending in the 2020 American General Election, as well as the Cook Partisan Voting Index.
- Instructions on where to find the European Values Survey data, which is freely available but you have to register.
- A sample of the Russian Trolls dataset distributed by fivethirtyeight, available in full here.
Other datasets may be added over time, depending on what I am teaching in my own classes and using in the problem sets (below), and what others generously make available.
NOTE: More information coming soon.
questions_and_problem_sets
contains all of the questions and problems associated with each chapter in the book. These materials were developing collaboratively, with much of the work done by my PhD students Pierson Browne and Tyler Crick in the context of TAing the graduate and undergraduate versions of my Computational Social Science class at University of Waterloo in Winter and Spring 2021. These materials will grow and evolve over time (typically coinciding with semesters when I am teaching Computational Social Science).
The book contains several chapters on contextual embeddings models, including how to train a variety of different types of embedding models over long time periods (e.g., over 100 years of large-scale text data).
Not everyone has access to the computational resources needed to train models like these, and there are very good reasons (e.g., limiting energy use) to avoid re-training them needlessly. As such, my students and I have made all of the contextual embedding models we've trained for this book and for a few related projects available here. These models were trained in my lab using our own servers, and will be updated over time as new data is released.
The pretrained_models
directory contains the models we trained for the Canadian Hansard and 120 years of academic scholarship on democracy and autocracy (see McLevey, Crick, Browne, and Durant 2022 "A new method for computational cultural cartography: From neural word embeddings to transformers and Bayesian mixture models").
Access information coming soon.
The figures
directory contains every figure from the book as vector graphics (PDF), and in a few cases (e.g., screenshots of websites) high-resolution PNGs.
supplementary_content
contains a number of notebooks that go beyond what is covered in the book. As of right now, it contains notebooks on collecting data from social media APIs (Twitter and Reddit) and on web scraping with Selenium. It also contains some additional content on analytical Bayesian inference that is intended to provide some additional clarity on the basic logic of Bayesian inference. If Bayesian inference is new to you, I suggest working through this example after working on Chapters 23 and 24.
advice
is a collection of little bits of wisdom and practical advice from a number of computational social scientists and data scientists on a wide-variety of different topics. Many more will be added over time.
For the book:
@book{mclevey2022computational,
title={Doing Computational Social Science},
author={McLevey, John},
year={2022},
location={London, UK},
publisher={Sage}
}
For this online supplement:
@book{dcss,
title={Doing Computational Social Science Online Supplement},
author={McLevey, John and Browne, Pierson and Crick, Tyler and Graham, Sasha},
year={2021},
howpublished = "\url{https://github.com/UWNETLAB/doing_computational_social_science}"
}