-
Notifications
You must be signed in to change notification settings - Fork 63
Available processors
Sal Hagen edited this page Aug 6, 2021
·
12 revisions
Below is a list of available processors. Depending on when this page was last updated, it may not reflect the current state of 4CAT.
Some of these processors are tried-and-tested, others are more experimental. Always check the code and data if the results seem fishy!
You only find short descriptions here. Make sure to inpect the code of the processors if you want to know more - we try our best to add as many comments as possible.
Let us know if you have any processors to add to this list.
Name | Description | Accepts |
---|---|---|
Attribute frequencies | Count frequencies for a given post attribute and aggregate the results, sorted by most-occurring value. Optionally results may be counted per period. | All top-level datasets |
Bipartite Author-tag Network | Produces a bipartite graph based on co-occurence of (hash)tags and people. If someone wrote a post with a certain tag, there will be a link between that person and the tag. The more often they appear together, the stronger the link. Tag nodes are weighed on how often they occur. User nodes are weighed on how many posts they've made. | All top-level csv datasets with a column for hashtags (tags , hashtags , groups ) |
Chart diachronic nearest neighbours | Visualise nearest neighbours of a given query across all models and show the closest neighbours per model in one combined graph. Based on the 'HistWords' algorithm by Hamilton et al. |
Generate Word Embedding Models processor results |
Co-column network | Create a Gephi-compatible network comprised of co-occurring values of two columns of the source file. For all items in the dataset, an edge is created between the values of the two columns, if they are not empty. Nodes and edges are weighted by frequency. | All csv and ndjson datasets |
Co-tag network | Create a Gephi-compatible network comprised of all tags appearing in the dataset, with edges between all tags used together on an item. Edges are weighted by the amount of co-tag occurrences; nodes are weighted by the frequency of the tag. | All top-level csv datasets with a column for hashtags (tags , hashtags , groups ) |
Co-word network | Create a Gephi-compatible network comprised of co-words, with edges between words that appear close to each other. Edges and nodes are weighted by the amount of co-word occurrences. |
Word collocations processor results |
Convert to CSV (Google Vision) | Convert Google Vision API output to a simplified CSV file. |
Google Vision API Analysis processor results |
Convert to Excel-compatible CSV | Converts a csv file to a Microsoft Excel-compatible csv file. | All csv datasets |
Convert to JSON | Converts a csv file to a JSON file. | All csv datasets |
Convert to TCAT JSON | Convert a NDJSON Twitter file to TCAT JSON format. Can be imported with TCAT's import-jsondump.php script. | Top-level Twitter datasets |
Count posts | Counts how many posts are in the dataset overall or per timeframe. | All top-level datasets |
Debate metrics | Returns a csv with meta-metrics on 'debate', like the amount of replies and long messages, per thread. | All top-level 4chan, 8chan, and 8kun datasets. |
Download images | Download top images and compress as a zip file. May take a while to complete as images are sourced externally. Note that not always all images can be retrieved. For imgur galleries, only the first image is saved. For imgur gifv files, only the first frame is saved. Use the "Add download status" option to see what downloads succeeded. |
Top images processor results |
Download YouTube thumbnails | Download YouTube video thumbnails. |
YouTube URL metadata processor results |
Expand shortened URLs | Expand shortened URLs. Replaces any URL in the dataset that is recognised as a shortened URL with the URL it redirects to. URLs are resolved recursively up to a depth of 5 links. This can take a long time for large datasets, and it is not recommended to run this processor on datasets larger than 10,000. | All top-level csv datasets |
Extract named entities | Get the prediction of various named entities from a text, ranked on frequency. Be sure to have selected "Named Entity Recognition" in the Linguistic Features processor. Currently only available for datasets with less than 25.000 items. |
Linguistic features processor results |
Extract nouns | Get the prediction of nouns from your text corpus, as annotated by SpaCy's part-of-speech tagging. Make sure to have selected "Part of Speech" in the Linguistic Features processor, as well as "Dependency parsing" if you want to extract compound nouns. The output is a csv with the most-used nouns ranked. |
Linguistic features processor results |
Filter by column | Copies a dataset, retaining only posts where the chosen 'column' (attribute) matches in the configured way. This creates a new, separate dataset you can run analyses on. | All top-level datasets |
Filter by lexicon | Copies a dataset, retaining only posts that match any selected lexicon of words or phrases. This creates a new, separate dataset you can run analyses on. | All top-level datasets |
Filter for unique posts | Retain only posts with unique post bodies. Only keeps the first encounter of a text. Useful for filtering spam. This creates a new, separate dataset you can run analyses on. | All top-level datasets |
Generate topic models | Creates topic models per token set using Latent Dirichlet Allocation (LDA). For a given number of topics, tokens are assigned a relevance weight per topic, which can be used to find clusters of related words. |
Tokenise processor results |
Generate Word Embedding Models | Generates Word2Vec or FastText word embedding models for the sentences, per chosen time interval. These can then be used to analyse semantic word associations within the corpus. Note that good models require large(r) datasets. |
Tokenise processor results |
Google Vision API analysis | Use the Google Vision API to annotate images with tags and labels identified via machine learning. One request will be made per image per annotation type. Note that this is NOT a free service and requests will be credited by Google to the owner of the API token you provide! |
Download images processor results |
Google Vision API Label network | Create a Gephi-compatible network comprised of all annotations returned for a set of images by the Google Vision API. Labels returned by the API are nodes; labels occurring on the same image form edges, weighted by the amount of co-tag occurrences. |
Google Vision analysis processor results |
Hatebase analysis | Analyse all posts' content with Hatebase, assigning a score for 'offensiveness' and a propability that the post contains hate speech. | All top-level datasets |
Histogram | Generates a histogram (bar graph) from a previous frequency analysis. | All rankable datasets |
Image wall | Put all images in an archive into a single combined image, optionally sorting and resizing them. |
Download images processor results |
Interactive Flowchart | Create a flow chart of elements over time. | All rankable datasets |
Linguistic features | Annotate your text with a variety of linguistic features, including part-of-speech tagging, depencency parsing, and named entity recognition. Subsequent modules can add identified tags and nouns to the original data file. Uses the SpaCy library and the en_core_web_sm model. Currently only available for datasets with less than 100.000 items. | All top-level datasets |
Merge post texts | Merges the body column of a dataset into one plain text string. The result can be used for word clouds, word trees, etc. | All top-level datasets |
Over-time offensiveness trend | Shows activity, engagement (e.g. views or score) and offensiveness trends over-time. Offensiveness is measured as the amount of words listed on Hatebase that occur in the dataset. | All top-level Telegram, Instagram and Reddit datasets. |
Over-time vocabulary prevalence | Determines the presence over time of a particular vocabulary in the dataset. Counts how many posts match at least one word in the provided vocabularies. | All top-level datasets |
RankFlow diagram | Create a diagram showing changes in prevalence over time for sequential ranked lists (following Bernhard Rieder's RankFlow grapher). | All rankable datasets |
Remove author information | Anonymises a dataset by removing content of any column starting with 'author'. | All top-level csv datasets |
Semantic frames | Extract semantic frames from text. This connects to the VUB's PENELOPE API to extract causal frames from the text using the framework developed by the Evolutionary and Hybrid AI (EHAI) group. |
Sentence split processor results |
Sentence split | Split a body of posts into discrete sentences. Output file has one row per sentence, containing the sentence and post ID | All top-level datasets |
Side-by-side graphs | Generate area graphs showing prevalence per item over time and project these side-by-side on an isometric plane for easy comparison. | All rankable datasets |
Sigma js network | Visualise a network in the browser with sigma js. | All gdf datasets |
Similar words | Uses a Word2Vec model to find words used in a similar context. |
Generate Word Embedding Models processor results |
Sort by most quoted | Sort posts by how often they were quoted by other posts in the data set. Post IDs may be correlated and triangulated with the full results set. | All top-level 4chan, 8chan, and 8kun datasets. |
Split by thread | Splits an output on the basis of different thread IDs. Threads are stored in different csv files and zipped in one archive. | All csv datasets with a thread_id column |
Tf-idf | Get the tf-idf values of tokenised text. Works better with more documents (e.g. day-separated). Can use Gensim's or sklearn's tf-idf modules. |
Tokenise processor results |
Thread metadata | Create an overview of the threads present in the dataset, containing thread IDs, subjects and post counts. | All top-level datasets |
Tokenise | Tokenises post bodies, producing corpus data that may be used for further processing by e.g. NLP. The output is a serialized list of lists, each list representing either all tokens in a post or all tokens in a sentence in a post. | All top-level datasets |
Top hateful phrases | Count frequencies for hateful words and phrases found in the dataset and aggregate the results, sorted by most-occurring value. Optionally results may be counted per period. |
Hatebase analysis processor results |
Top images | Collect all images used in the data set (either appearing as URLs or in an image column), and sort by most used. Contains URLs through which the images may be downloaded. | All top-level datasets |
Top vectors | Ranks most used tokens per token set. Reveals most-used words and/or most-used vernacular per time period. Limited to 100 most-used tokens. |
Vectorise tokens processor results |
Top words per topic | Creates a CSV file with the top tokens (words) per topic in the generated topic model, and their associated weights. |
Generate topic models processor results |
Update Reddit post scores | Updates the scores for Reddit posts and comments to more accurately reflect the real score - the scores from Pushshift can be inaccurate. Can only be used on datasets with < 5,000 posts due to the heavy usage of the API this requires. | All top-level Reddit datasets |
URL co-link network | Create a Gephi-compatible network comprised of all URLs appearing in a post with at least one other URL. Appearing in the same post constitutes an edge between these nodes. Edges are weighted by amount of co-links. | All top-level datasets |
Word collocations | Extracts word combinations from a set of tokens. |
Tokenise processor results |
Word tree | Generates a word tree for a given query, a "graphical version of the traditional 'keyword-in-context' method" (Wattenberg & Viégas, 2008). | All top-level datasets |
YouTube thumbnails image wall | Make an image wall from YouTube video thumbnails. |
Download YouTube thumbnail processor results |
YouTube URL metadata | Extract information from YouTube links to videos and channels mentioned in the dataset. | All top-level datasets |
🐈🐈🐈🐈