Skip to content

Available processors

Dale Wahl edited this page Aug 9, 2021 · 12 revisions

Below is a list of available processors. Depending on when this page was last updated, it may not reflect the current state of 4CAT.

Some of these processors are tried-and-tested, others are more experimental. Always check the code and data if the results seem fishy!

You only find short descriptions here. Make sure to inpect the code of the processors if you want to know more - we try our best to add as many comments as possible.

Let us know if you have any processors to add to this list.

Name Description Accepts
Attribute frequencies Count frequencies for a given post attribute and aggregate the results, sorted by most-occurring value. Optionally results may be counted per period. All top-level datasets
Bipartite Author-tag Network Produces a bipartite graph based on co-occurence of (hash)tags and people. If someone wrote a post with a certain tag, there will be a link between that person and the tag. The more often they appear together, the stronger the link. Tag nodes are weighed on how often they occur. User nodes are weighed on how many posts they've made. All top-level csv datasets with a column for hashtags (tags, hashtags, groups)
Chart diachronic nearest neighbours Visualise nearest neighbours of a given query across all models and show the closest neighbours per model in one combined graph. Based on the 'HistWords' algorithm by Hamilton et al. Generate Word Embedding Models processor results
Co-column network Create a Gephi-compatible network comprised of co-occurring values of two columns of the source file. For all items in the dataset, an edge is created between the values of the two columns, if they are not empty. Nodes and edges are weighted by frequency. All csv and ndjson datasets
Co-tag network Create a Gephi-compatible network comprised of all tags appearing in the dataset, with edges between all tags used together on an item. Edges are weighted by the amount of co-tag occurrences; nodes are weighted by the frequency of the tag. All top-level csv datasets with a column for hashtags (tags, hashtags, groups)
Co-word network Create a Gephi-compatible network comprised of co-words, with edges between words that appear close to each other. Edges and nodes are weighted by the amount of co-word occurrences. Word collocations processor results
Convert to CSV (Google Vision) Convert Google Vision API output to a simplified CSV file. Google Vision API Analysis processor results
Convert to Excel-compatible CSV Converts a csv file to a Microsoft Excel-compatible csv file. All csv datasets
Convert to JSON Converts a csv file to a JSON file. All csv datasets
Convert to TCAT JSON Convert a NDJSON Twitter file to TCAT JSON format. Can be imported with TCAT's import-jsondump.php script. Top-level Twitter datasets
Count posts Counts how many posts are in the dataset overall or per timeframe. All top-level datasets
Debate metrics Returns a csv with meta-metrics on 'debate', like the amount of replies and long messages, per thread. All top-level 4chan, 8chan, and 8kun datasets.
Download images Download top images and compress as a zip file. May take a while to complete as images are sourced externally. Note that not always all images can be retrieved. For imgur galleries, only the first image is saved. For imgur gifv files, only the first frame is saved. Use the "Add download status" option to see what downloads succeeded. Top images processor results
Download YouTube thumbnails Download YouTube video thumbnails. YouTube URL metadata processor results
Expand shortened URLs Expand shortened URLs. Replaces any URL in the dataset that is recognised as a shortened URL with the URL it redirects to. URLs are resolved recursively up to a depth of 5 links. This can take a long time for large datasets, and it is not recommended to run this processor on datasets larger than 10,000. All top-level csv datasets
Extract named entities Get the prediction of various named entities from a text, ranked on frequency. Be sure to have selected "Named Entity Recognition" in the Linguistic Features processor. Currently only available for datasets with less than 25.000 items. Linguistic features processor results
Extract nouns Get the prediction of nouns from your text corpus, as annotated by SpaCy's part-of-speech tagging. Make sure to have selected "Part of Speech" in the Linguistic Features processor, as well as "Dependency parsing" if you want to extract compound nouns. The output is a csv with the most-used nouns ranked. Linguistic features processor results
Filter by column Copies a dataset, retaining only posts where the chosen 'column' (attribute) matches in the configured way. This creates a new, separate dataset you can run analyses on. All top-level datasets
Filter by lexicon Copies a dataset, retaining only posts that match any selected lexicon of words or phrases. This creates a new, separate dataset you can run analyses on. All top-level datasets
Filter for unique posts Retain only posts with unique post bodies. Only keeps the first encounter of a text. Useful for filtering spam. This creates a new, separate dataset you can run analyses on. All top-level datasets
Generate topic models Creates topic models per token set using Latent Dirichlet Allocation (LDA). For a given number of topics, tokens are assigned a relevance weight per topic, which can be used to find clusters of related words. Tokenise processor results
Generate Word Embedding Models Generates Word2Vec or FastText word embedding models for the sentences, per chosen time interval. These can then be used to analyse semantic word associations within the corpus. Note that good models require large(r) datasets. Tokenise processor results
Google Vision API analysis Use the Google Vision API to annotate images with tags and labels identified via machine learning. One request will be made per image per annotation type. Note that this is NOT a free service and requests will be credited by Google to the owner of the API token you provide! Download images processor results
Google Vision API Label network Create a Gephi-compatible network comprised of all annotations returned for a set of images by the Google Vision API. Labels returned by the API are nodes; labels occurring on the same image form edges, weighted by the amount of co-tag occurrences. Google Vision analysis processor results
Hatebase analysis Analyse all posts' content with Hatebase, assigning a score for 'offensiveness' and a propability that the post contains hate speech. All top-level datasets
Histogram Generates a histogram (bar graph) from a previous frequency analysis. All rankable datasets
Image wall Put all images in an archive into a single combined image, optionally sorting and resizing them. Download images processor results
Interactive Flowchart Create a flow chart of elements over time. All rankable datasets
Linguistic features Annotate your text with a variety of linguistic features, including part-of-speech tagging, depencency parsing, and named entity recognition. Subsequent modules can add identified tags and nouns to the original data file. Uses the SpaCy library and the en_core_web_sm model. Currently only available for datasets with less than 100.000 items. All top-level datasets
Merge post texts Merges the body column of a dataset into one plain text string. The result can be used for word clouds, word trees, etc. All top-level datasets
Over-time offensiveness trend Shows activity, engagement (e.g. views or score) and offensiveness trends over-time. Offensiveness is measured as the amount of words listed on Hatebase that occur in the dataset. All top-level Telegram, Instagram and Reddit datasets.
Over-time vocabulary prevalence Determines the presence over time of a particular vocabulary in the dataset. Counts how many posts match at least one word in the provided vocabularies. All top-level datasets
RankFlow diagram Create a diagram showing changes in prevalence over time for sequential ranked lists (following Bernhard Rieder's RankFlow grapher). All rankable datasets
Remove author information Anonymises a dataset by hashing/encoding content of any column starting with 'author' (e.g., every occurrence of "John Smith" becomes "bf2fe1ba8fd013d2ca03eba5449d4bec"). All top-level csv datasets
Semantic frames Extract semantic frames from text. This connects to the VUB's PENELOPE API to extract causal frames from the text using the framework developed by the Evolutionary and Hybrid AI (EHAI) group. Sentence split processor results
Sentence split Split a body of posts into discrete sentences. Output file has one row per sentence, containing the sentence and post ID All top-level datasets
Side-by-side graphs Generate area graphs showing prevalence per item over time and project these side-by-side on an isometric plane for easy comparison. All rankable datasets
Sigma js network Visualise a network in the browser with sigma js. All gdf datasets
Similar words Uses a Word2Vec model to find words used in a similar context. Generate Word Embedding Models processor results
Sort by most quoted Sort posts by how often they were quoted by other posts in the data set. Post IDs may be correlated and triangulated with the full results set. All top-level 4chan, 8chan, and 8kun datasets.
Split by thread Splits an output on the basis of different thread IDs. Threads are stored in different csv files and zipped in one archive. All csv datasets with a thread_id column
Tf-idf Get the tf-idf values of tokenised text. Works better with more documents (e.g. day-separated). Can use Gensim's or sklearn's tf-idf modules. Tokenise processor results
Thread metadata Create an overview of the threads present in the dataset, containing thread IDs, subjects and post counts. All top-level datasets
Tokenise Tokenises post bodies, producing corpus data that may be used for further processing by e.g. NLP. The output is a serialized list of lists, each list representing either all tokens in a post or all tokens in a sentence in a post. All top-level datasets
Top hateful phrases Count frequencies for hateful words and phrases found in the dataset and aggregate the results, sorted by most-occurring value. Optionally results may be counted per period. Hatebase analysis processor results
Top images Collect all images used in the data set (either appearing as URLs or in an image column), and sort by most used. Contains URLs through which the images may be downloaded. All top-level datasets
Top vectors Ranks most used tokens per token set. Reveals most-used words and/or most-used vernacular per time period. Limited to 100 most-used tokens. Vectorise tokens processor results
Top words per topic Creates a CSV file with the top tokens (words) per topic in the generated topic model, and their associated weights. Generate topic models processor results
Update Reddit post scores Updates the scores for Reddit posts and comments to more accurately reflect the real score - the scores from Pushshift can be inaccurate. Can only be used on datasets with < 5,000 posts due to the heavy usage of the API this requires. All top-level Reddit datasets
URL co-link network Create a Gephi-compatible network comprised of all URLs appearing in a post with at least one other URL. Appearing in the same post constitutes an edge between these nodes. Edges are weighted by amount of co-links. All top-level datasets
Word collocations Extracts word combinations from a set of tokens. Tokenise processor results
Word tree Generates a word tree for a given query, a "graphical version of the traditional 'keyword-in-context' method" (Wattenberg & Viégas, 2008). All top-level datasets
YouTube thumbnails image wall Make an image wall from YouTube video thumbnails. Download YouTube thumbnail processor results
YouTube URL metadata Extract information from YouTube links to videos and channels mentioned in the dataset. All top-level datasets