Skip to content

How to make a processor

Stijn Peeters edited this page Dec 11, 2019 · 29 revisions

Processors

4CAT is a modular tool. Its modules come in two varietes: data sources and processors. This article covers the latter.

Processors are bits of code that produce a dataset. Typically, their input is another dataset. As such they can be used to analyse data; for example, a processor can take a csv file containing posts as input, count how many posts occur per month, and produce another csv file with the amount of posts per month (one month per row) as output.

4CAT has an API that can do most of the scaffolding around this for you so processors can be quite lightweight and mostly focus on the analysis while 4CAT's back-end takes care of the scheduling, determining where the output should go, et cetera.

Example

This is a minimal example of a 4CAT processor:

from backend.abstract.processor import BasicProcessor

class ExampleProcessor(BasicProcessor):
	type = "example-processor"
	category = "Examples"
	title = "A simple example"
	description = "This doesn't do much"
	extension = "csv"

	input = "csv:body"
	output = "csv:value"

	def process(self):
		data = {"value": "Hello world!"}
		self.write_csv_and_finish(data)

Or, annotated:

"""
A minimal example 4CAT processor
"""
from backend.abstract.processor import BasicProcessor

class ExampleProcessor(BasicProcessor):
	"""
	Example Processor
	"""
	type = "example-processor"  # job type ID
	category = "Examples" # category
	title = "A simple example"  # title displayed in UI
	description = "This doesn't do much"  # description displayed in UI
	extension = "csv"  # extension of result file, used internally and in UI

	input = "csv:body"
	output = "csv:value"

	def process(self):
		"""
		Saves a CSV file with one column ("value") and one row with a value ("Hello
		world") and marks the dataset as finished.
		"""
		data = {"value": "Hello world!"}
		self.write_csv_and_finish(data)

Processor properties

Processor settings and metadata is stored as a class property. The following properties are available and can be set:

  • type (string) - A unique identifier for this processor type. Used internally to queue jobs, store results, et cetera. Should be a string without any spaces in it.
  • category (string) - A category for this processor, used in the 4CAT web interface to group processors thematically. Displayed as-is in the web interface.
  • title (string) - Processor title, displayed as-is in the web interface.
  • description (string) - Description, displayed in the web interface. Markdown in the description will be parsed.
  • extension (string) - Extension for files produced by this processor, e.g. csv or json.
  • input (string) - Simple schema for the input files required by this processor, in the format type[:fields]. For example, csv:column1,column2 means the processor requires a CSV file with at least the columns column1 and column2 as input. The field definition is optional, i.e. just csv is also valid.
  • output (string) - Simple schema for the files produced by this processor.
  • option (dictionary) - A dictionary of configurable parameters for this processor. These will be displayed in the web interface as an HTML form so users can configure output. This takes

Processor API

The full API available to processors is as follows. All of these are members of the processor object, i.e. they are accessed via self.property within the processor code