-
Notifications
You must be signed in to change notification settings - Fork 63
4CAT Architecture
4CAT is a tool built around a set of three objects that capture, process and describe the data. These are high-level concepts that are core to 4CAT's architecture, but also map quite precisely to specific data structures within 4CAT's code. Links to such code are given below.
A data source is a collection of files that define a way to consistently get data from a given source in such a way that 4CAT understands it. Data sources define a search worker which is a special type of processor that has no dataset as input, but produces a dataset as output from an external data source based on the given parameters. For example, the Reddit datasource uses the given parameters to query the Pushshift API and saves the data Pushshift sends in return as a CSV file with a standardised format with the following columns:"
-
id
: A unique identifier for the item -
thread_id
: A unique identifier for a collection of items. In practice, this will often be a 'thread'; e.g. a Reddit thread (with all posts in it as items); an Instagram post (with all comments to it as items); a Telegram channel (with all messages in it). -
timestamp
: The timestamp at which the item was created. -
author
: Author of the item. Optionally, this will be pseudonymised. body
These columns are provided by all data sources (for some they may be empty) so that the data can subsequently be processed in a standardised way. Some data sources may provide additional columns (e.g. subreddit
for Reddit datasets). Next to the code that produces this result, data sources also contain a definition for a user-facing form that is used within the web interface to allow users to create new data sets.
4CAT is, when it comes down to it, a machine for turning one dataset into another. An initial dataset is created from a data source. The dataset in practice consists of a file (the data) and a database record (the metadata). The file is often a CSV file, but some processors also produce other files such as SVGs or zip archives. The metadata provides the 'provenance' of the dataset. Notably, it contains the following information:
-
software_version
: The git commit ID of the commit that was checked out when the dataset was created. -
software_file
: The python file responsible for creating the dataset -
parameters
: Input parameters provided by a user that created the dataset
Together with a number of miscellaneous other bits of metadata, this in principle makes the dataset reproducible; it becomes possible to identify the exact version of 4CAT and the exact Python code that generated the data file. In the 4CAT interface, a link to the relevant file on Github is included, so that this code can easily be looked up.
Processors are the code that turns one dataset (or in the case of data sources, some input parameters) into another. These are Python files that use a 4CAT Python API. A set of class properties and required methods are defined and they are then detected by the 4CAT backend and made available in the interface. Importantly, processor files indicate what type of dataset they can be used on. The type of a dataset is determined by the processor that produces it. So a 'hashtag co-tag network' processor can only be used on the output of the 'instagram search' processor; the 'create RankFlow diagram' processor works on the output of both the 'top usernames per month' and 'most-occurring words per month' processors, and so on. Alternatively (this is work in progress) processors can indicate their compatibility on a schema level, i.e. they are made available for CSV files that contain a particular set of columns.
There are three 'special' types of processors.
-
The first - search processors - belong to data sources, and do not require an input dataset; these produce the initial 'root' dataset. Example.
-
The second - filter processors - have the distinction of producing new 'root' datasets as output. This has the advantage of making them available via a distinct page in the web interface, which makes sense in certain cases (one application is filtering a dataset for only items containing a particular word, giving a new, filtered dataset). Example.
-
The third - presets - define a 'chain' of processors to be run in sequence. This type of processor does not do anything itself, but instead provides a 'recipe'; for example, first break each item in the dataset into separate tokens; then count how often each individual token occurs per month; then visualise the frequencies as a histogram. Each of these steps is done by a separate processor. The preset makes it easy to provide established research protocols to users without obscuring what happens within that protocol; the outcome of each intermediate step is available in full detail. Example.
A data source defines a special type of processor that can be used to make an initial dataset with. The resulting dataset can be processed further, and potentially again and again, the available processors each time depending on the type of the data. This way each initial dataset branches out into a potentially infinite tree of derivative datasets.
Importantly, data sources and processors are made to be as modular as possible. Besides the data properties mandatory to each initial data set, both data sources and processors are in principle free to decide for themselves what data to include and not to include in the data sets they produce. As such they can be tailored to the research object or interest at hand; some processors may only make sense for a given type of data, e.g. those dealing with hashtags, which do not appear on all platforms.
🐈🐈🐈🐈