-
Notifications
You must be signed in to change notification settings - Fork 63
Available processors
Stijn Peeters edited this page Nov 10, 2022
·
12 revisions
Below is a list of available processors. Depending on when this page was last updated, it may not reflect the current state of 4CAT.
Some of these processors are tried-and-tested, others are more experimental. Always check the code and data if the results seem fishy!
You only find short descriptions here. Make sure to inpect the code of the processors if you want to know more - we try our best to add as many comments as possible.
Let us know if you have any processors to add to this list.
Name | Description | Usage |
---|---|---|
Annotate images with Google Vision API | Use the Google Vision API to extract labels detected in the most-linked images from the dataset. Note that this is a paid service and will count towards your API credit. | Requires entering an API key |
Monthly histogram | Generates a histogram with the number of posts per month. | |
Extract neologisms | Retrieve uncommon terms by deleting all known words. Assumes English-language data. Uses stopwords-iso as its stopword filter. | |
Find similar words | Uses Word2Vec models (Mikolov et al.) to find words used in a similar context as the queried word(s). Note that this will usually not give useful results for small (<100.000 items) datasets. | |
Upload to DMI-TCAT | Convert the dataset to a TCAT-compatible format and upload it to an available TCAT server. | Available TCAT servers to be configured by instance admin |
Name | Description | Usage |
---|---|---|
Convert to JSON | Change a CSV file to a JSON file | |
Convert to Excel-compatible CSV | Change a CSV file so it works with Microsoft Excel. | |
Convert NDJSON file to CSV | Change a NDJSON file to a CSV file. | |
Convert to TCAT JSON | Convert a Twitter dataset to a TCAT-compatible format. This file can then be uploaded to TCAT. | |
Convert Vision results to CSV | Convert the Vision API output to a simplified CSV file. | |
Merge datasets | Merge this dataset with another dataset of the same type. A new, third dataset is created containing items from both original datasets. | |
Split by thread | Split the dataset per thread. The result is a zip archive containing separate CSV files. | |
Merge texts | Merges the data from the body column into a single text file. The result can be used for word clouds, word trees, etc. | |
Upload to DMI-TCAT | Send a TCAT-ready JSON file to a particular DMI-TCAT server. | Available TCAT servers to be configured by instance admin |
Name | Description | Usage |
---|---|---|
Download YouTube thumbnails | Downloads the thumbnails of YouTube videos and stores it in a zip archive. |
Name | Description | Usage |
---|---|---|
Replace or transliterate accented and non-Latin characters | Replaces non-latin characters with the closest ASCII equivalent, convertng e.g. 'á' to 'a', 'ç' to 'c', et cetera. Creates a new dataset. | |
Remove author information | Anonymises a dataset by removing content of any column starting with 'author' | |
Filter by value | A generic filter that checks whether a value in a selected column matches a custom requirement. This will create a new dataset. | |
Filter by date | Retains posts between given dates. This will create a new dataset. | |
Expand shortened URLs | Replaces any URL in the dataset's 'body' field that is recognised as a shortened URL with the URL it redirects to. URLs are followed up to a depth of 5 links. This can take a long time for large datasets, and it is not recommended to run this processor on datasets larger than 10,000 items. This creates a new dataset with expanded URLs in place of redirects. | |
Update Reddit scores | Updates the scores for each post and comment to more accurately reflect the real score. Can only be used on datasets with < 5,000 posts due to the heavy usage of the Reddit API. | Requires server admin to provide a Reddit API key |
Filter by words or phrases | Retains posts that contain selected words or phrases, including preset word lists. This creates a new dataset. | |
Random sample | Retain a pseudorandom set of posts. This creates a new dataset. | |
Filter for unique posts | Retain posts with a unique body text. Only keeps the first encounter of a text. Useful for filtering spam. This creates a new dataset. | |
Filter by wildcard | Retains only posts that contain certain words or phrases. Input may contain a wildcard *, which matches all text in between. This creates a new dataset. | |
Write annotations | Writes annotations from the Explorer to the dataset. Each input field will get a column. This creates a new dataset. | Cannot be called directly; called via the Explorer feature |
Name | Description | Usage |
---|---|---|
Custom network | Create a GEXF network file comprised of linked values between a custom set of columns (e.g. 'author' and 'subreddit'). Nodes and edges are weighted by frequency. | |
Bipartite Author-tag Network | Produces a bipartite graph based on co-occurence of (hash)tags and people. If someone wrote a post with a certain tag, there will be a link between that person and the tag. The more often they appear together, the stronger the link. Tag nodes are weighed on how often they occur. User nodes are weighed on how many posts they've made. | |
Co-tag network | Create a GEXF network file of tags co-occurring in a posts. Edges are weighted by the amount of tag co-occurrences; nodes are weighted by how often the tag appears in the dataset. | |
Co-word network | Create a GEXF network file of word co-occurences. Edges denote words that appear close to each other. Edges and nodes are weighted by the amount of co-word occurrences. | |
Reply network | Create a GEXF network file of posts replying to each other. Each reference to another post creates an edge between posts. | Only available for data sources where replying is a feature |
URL co-occurence network | Create a GEXF network file comprised of URLs appearing together (in a post or thread). Edges are weighted by amount of co-links. | |
Google Vision API Label network | Create a GEXF network file comprised of all annotations returned by the Google Vision API. Labels returned by the API are nodes. Labels occurring on the same image areedges. | Requires API key |
Wikipedia category network | Create a GEXF network file comprised network comprised of linked-to Wikipedia pages, linked to the categories they are part of. English Wikipedia only. Will only fetch the first 10,000 links. | Slow! |
Name | Description | Usage |
---|---|---|
Count values | Count values in a dataset column, like URLs or hashtags (overall or per timeframe) | |
Count posts | Counts how many posts are in the dataset (overall or per timeframe). | |
Google Vision API Analysis | Use the Google Vision API to annotate images with tags and labels identified via machine learning. One request will be made per image per annotation type. Note that this is NOT a free service and requests will be credited by Google to the owner of the API token you provide! | |
Hatebase analysis | Assign scores for 'offensiveness' and hate speech propability to each post by using Hatebase. | Uses included Hatebase lexicon (which has limitations) |
Extract top hateful phrases | Count frequencies for hateful words and phrases found in the dataset and rank the results (overall or per timeframe). | Uses included Hatebase lexicon (which has limitations) |
Over-time offensivess trend | Extracts offensiveness trends over-time. Offensiveness is measured as the amount of words listed on Hatebase that occur in the dataset. Also includes engagement metrics. | Uses included Hatebase lexicon (which has limitations) |
Over-time word counts | Determines the counts over time of particular set of words or phrases. | |
Sort by most replied-to | Sort posts by how often they were replied to by other posts in the dataset. | |
Extract Text from Images | Uses optical character recognition (OCR) to extract text from images via machine learning. | Requires a separate OCR server, to be configured by 4CAT admin |
Thread metadata | Create an overview of the threads present in the dataset, containing thread IDs, subjects, and post counts. | |
Rank image URLs | Collect all image URLs and sort by most-occurring. | |
Extract top words | Ranks most used tokens per tokenset (overall or per timeframe). Limited to 100 most-used tokens. | |
Extract YouTube metadata | Extract information from YouTube videos and channels linked-to in the dataset |
Name | Description | Usage |
---|---|---|
Extract co-words | Extracts words appearing close to each other from a set of tokens. | After tokenisation |
Count documents per topic | Uses the LDA model to predict to which topic each item or sentence belongs and counts as belonging to whichever topic has the highest probability. | After tokenisation |
Post/Topic matrix | Uses the LDA model to predict to which topic each item or sentence belongs and creates a CSV file showing this information. Each line represents one 'document'; if tokens are grouped per 'item' and only one column is used (e.g. only the 'body' column), there is one row per post/item, otherwise a post may be represented by multiple rows (for each sentence and/or column used). | After tokenisation |
Extract nouns | Retrieve nouns detected by SpaCy's part-of-speech tagging, and rank by frequency. Make sure to have selected "Part of Speech" in the previous module, as well as "Dependency parsing" if you want to extract compound nouns or noun chunks. | After SpaCy processing |
Generate word embedding models | Generates Word2Vec or FastText word embedding models (overall or per timeframe). These calculate coordinates (vectors) per word on the basis of their context. The coordinates are positioned in a "vector space" with a large amount of dimensions (so a coordinate can e.g. exist of 100 numbers). These numeric word representations can be used to extract words with similar contexts. Note that good models require a lot of data. | After tokenisation |
Extract named entities | Retrieve named entities detected by SpaCy, ranked on frequency. Be sure to have selected "Named Entity Recognition" in the previous module. | |
Annotate text features with SpaCy | Annotate your text with a variety of linguistic features using the SpaCy library, including part-of-speech tagging, depencency parsing, and named entity recognition. Subsequent processors can extract the words labelled by SpaCy (e.g. as a noun or name). Produces a Doc file using the en_core_web_sm model. Currently only available for datasets with less than 100,000 items. | |
Semantic frames | Extract semantic frames from text. This connects to the VUB's PENELOPE API to extract causal frames from the text using the framework developed by the Evolutionary and Hybrid AI (EHAI) group. | |
Sentence split | Split a body of posts into discrete sentences. Output file has one row per sentence, containing the sentence and post ID. | |
Extract similar words | Uses a Word2Vec model to find words used in a similar context | After tokenisation and model building |
Tf-idf | Get the tf-idf values of tokenised text. Works better with more documents (e.g. time-separated). | After tokenisation |
Tokenise | Splits the post body texts in separate words (tokens). This data can then be used for text analysis. The output is a list of lists (each list representing all post tokens or tokens per sentence). | |
Visualise LDA Model | Creates a visualisation of the chosen LDA model allowing exploration of the various words in each topic. | After tokenisation |
Top words per topic | Creates a CSV file with the top tokens (words) per topic in the generated topic model, and their associated weights. | After tokenisation and model building |
Generate topic models | Creates topic models per tokenset using Latent Dirichlet Allocation (LDA). For a given number of topics, tokens are assigned a relevance weight per topic, which can be used to find clusters of related words. | After tokenisation |
Count words | Counts all tokens so they are transformed into word => frequency counts.This is also known as a bag of words. | After tokenisation |
Name | Description | Usage |
---|---|---|
Debate metrics | Returns a csv with meta-metrics per thread. |
Name | Description | Usage |
---|---|---|
Twitter Statistics | Contains the number of tweets, number of tweets with links, number of tweets with hashtags, number of tweets with mentions, number of retweets, and number of replies | |
Custom Statistics | Group tweets by category and count tweets per timeframe to collect aggregate group statistics. For retweets and quotes, hashtags, mentions, URLs, and images from the original tweet are included in the retweet/quote. Data on public metrics (e.g., number of retweets or likes of tweets) are as of the time the data was collected. | |
Aggregated Statistics | Group tweets by category and count tweets per timeframe and then calculate aggregate group statistics (i.e. min, max, average, Q1, median, Q3, and trimmed mean): number of tweets, urls, hashtags, mentions, etc. Use for example to find the distribution of the number of tweets per author and compare across time. | |
Aggregated Statistics Visualization | Gathers Aggregated Statistics data and creates Box Plots visualising the spread of intervals. A large number of intervals will not properly display. | |
Hashtag Statistics | Lists by hashtag how many tweets contain hashtags, how many times those tweets have been retweeted/replied to/liked/quoted, and information about unique users and hashtags used alongside each hashtag. For retweets and quotes, hashtags from the original tweet are included in the retweet/quote. | |
Identical Tweet Frequency | Groups tweets by text and counts the number of times they have been (re)tweeted indentically. | |
Mentions Export | Identifies mentions types and creates mentions table (tweet id, from author id, from username, to user id, to username, mention type) | |
Source Statistics | Lists by source of tweet how many tweets contain hashtags, how many times those tweets have been retweeted/replied to/liked/quoted, and information about unique users and hashtags used alongside each hashtag. For retweets and quotes, hashtags from the original tweet are included in the retweet/quote. | |
Individual User Statistics | Lists users and their number of tweets, number of followers, number of friends, how many times they are listed, their UTC time offset, whether the user has a verified account and how many times they appear in the data set. | |
User Visibility | Collects usernames and totals how many tweets are authored by the user and how many tweets mention the user |
Name | Description | Usage |
---|---|---|
Histogram | Generates a histogram (bar graph) from time frequencies. | |
Chart diachronic nearest neighbours | Visualise nearest neighbours of a given query across all models and show the closest neighbours per model in one combined graph. Based on the 'HistWords' algorithm by Hamilton et al. | |
Download images | Download images and store in a a zip file. May take a while to complete as images are retrieved externally. Note that not always all images can be saved. For imgur galleries, only the first image is saved. For animations (GIFs), only the first frame is saved if available. A JSON metadata file is included in the output archive. 4chan datasets should include the image_md5 column. | |
Download Telegram images | Download images and store in a zip file. Downloads through the Telegram API might take a while. Note that not always all images can be retrieved. A JSON metadata file is included in the output archive. | |
Image wall | Put all images in a single combined image. Images can be sorted and resized. | |
Create PixPlot visualisation | Put all images from an archive into a PixPlot visualisation: an explorable map of images algorithmically grouped by similarity. | Requires a separate PixPlot service, to be configured by the 4CAT admin |
Side-by-side graphs | Generate area graphs showing prevalence per item over time. These are visualised side-by-side on an isometric plane for easy comparison. | |
RankFlow diagram | Create a diagram showing changes in prevalence over time for ranked lists (following Bernhard Rieder's RankFlow. | |
Word tree | Generates a word tree for a given query, a "graphical version of the traditional 'keyword-in-context' method" (Wattenberg & Viégas, 2008). | |
Word cloud | Generates a word cloud with words sized on occurrence. | |
YouTube thumbnails image wall | Make an image wall from YouTube video thumbnails. |
Some processors may require additional setup or modification. Processors can be configured by 4CAT administrators via the '4CAT Settings' navigation menu option.
🐈🐈🐈🐈