Skip to content

On‐demand tasks creation

rgaudin edited this page Sep 4, 2023 · 1 revision

Zimfarm's core is a machinery to create ZIM files from recipes (schedules). It is mainly used via it's JS frontend (https://farm.openzim.org) focused on the Kiwix Catalog, which is a large set of routinely created ZIM files from identified resources.

Zimfarm can also be leveraged for one-shot tasks that produce unique ZIM which won't make it to the Kiwix Catalog. That's the case for the ZIM produced by:

  • youzim.it, to create ZIM from generic web sites.
  • wp1, to create ZIM from custom selections of Wikimedia project articles
  • [WIP] nautilus, to create ZIM from a collection of user-uploaded files.

This document gives pointers to help such projects interact with Zimfarm.

Requirements

  • A Zimfarm endpoint. Can be https://api.farm.youzim.it/v1 (most likely) or https://farm.openzim.org/v1
  • A manager account on said API (username and password)
  • An S3 setup.
    • Usually those services upload to S3 buckets
    • S3 bucket
    • S3 credentials
    • Bucket has an automatic expiration (files automatically deleted after x nb of days
  • ZF worker(s) are configured to accept your tasks (offliner) and are on the S3 upload whitelist for the bucket

Process & Implementation

ZF was initially created for routine ZIM creation. Due to this historic requirement, one cannot simply request a task but must first create a schedule and request a task for the created schedule. Schedule can then be removed.

A typical usage follows:

  • Authenticating
  • Creating a schedule
  • Requesting a task_id on created schedule
  • Deleting created schedule
  • [async]
    • querying the status of the task
    • accepting the callback HTTP request

Authentication

Zimfarm API is mostly public, in read-only mode. Requesting tasks obviously requires authentication. Check the API documentation or refer to this simple implementation in Python

Creating a schedule

This is the part that required the most care and the most knowledge of Zimfarm. It's basically contains everything about what you want to do.

It's as simple as a JSON POST to /schedules (201 HTTP response) but the content needs attention

{
  "name": "Unique name",
  "language": {"code": "eng", "name_en": "English", "name_native": "English"},
  "category": "other",
  "periodicity": "manually",
  "tags": [],
  "enabled": true,
  "config": ...,
  "notification": ...
}
  • name is a unique name for your temporary schedule. Add some randomness but include some human-relevant part as well to ease debugging.
  • language is a ZF only property used only on the farm UI so just use English as described. It has no purpose here.
  • category same as for language.
  • periodicity: set it to manually otherwise the scheduler might attempt to run it again (if not deleted by then)
  • tags same as language
  • enabled set it to true or your request call will fail. This says the schedule is usable.
  • config the meat of the schedule, detailed below _ notification configures the webhook callback (see below)

config

This sets all the details about the task to be created (actually, config is copied over from schedule to task).

{
  "task_name": "mwoffliner",
  "warehouse_path": "/other",
  "image": {
    "name": "ghcr.io/openzim/mwoffliner",
    "tag": "latest",
  },
  "resources": {
    "cpu": 3,
    "memory": 1000000000,
    "disk": 1000000000,
    "shm": 1000000000,
    "cap_add": ["SYS_ADMIN", "NET_ADMIN"],
  },
  "platform": null,
  "monitor": false,
  "flags": {"flag": "flag-value"},
}
  • task_name is the name of the offliner/scraper that will be called to create a ZIM off your request. Check the list of supported ones. Warning: worker machines configures a list of offliner they accept to run. Make sure there will be workers for your tasks.
  • warehouse_path (mandatory) sub-folder inside the upload destination (or a prefix for your S3 uploads in your bucket)
  • image: OCI (Docker) image name and tag. Use the ghcr.io hosted images built by openzim (others wont work) but choose your tag freely. On-demand services tend to use latest.
  • resources: cpu, memory and disk are mandatory. this works as a reservation of requests on the worker. If you request 4GB or RAM for a schedule/task, the task won't be run until a complying worker with 4GB free RAM is available. Experiment to find out about what resources you need and requests the minimum.
  • resources.cpu it's an integer. Use 3 for historical reason as the algorithm expects it. Not much use.
  • resources.memory and resources.disk requested amount of RAM and disk space, in bytes.
  • shm only include the key is required. Amount of (additional) RAM space to mount to /dev/shm on scraper's container.
  • cap_add only include key if required. Specific extra docker permissions to assign to the scraper's container. Not supported by all workers
  • platform set it if the scraper you want to run is subject to one (see Platforms list). It adds further restrictions on concurrent tasks runs.
  • monitor set it to false, it's a debugging tool that's not supported by all workers
  • flags dict of cli parameters for the scraper/offliner. Each offliner is defined in its own file here. Required arguments are marked as such. Boolean args (--verbose) are set as "--verbose": true for instance.

notification

ZF is able trigger notifications on some task events. It supports e-mail (sent as Zimfarm) via mailgun and slack but third party services usually rely on the webhook. It's an HTTP POST request sent by ZF with the task document sent as JSON payload (same as if retrieved from /tasks/{taskId}

⚠️ the webhook is attempted once per matching event. Should it fail, there is no retry mechanism. It doesn't support any kind of authentication.

{
   "requested": {"webhook": "http://..."},
   "ended": {"webhook": "https://..."}
}
  • requested and ended are most likely the events you want to be informed about. See the list of events. Note that all those matching .complete() (failed, succeeded, canceled) must be declared here as ended.

It is common to include some kind of credentials to this URL to prevent people from abusing your webhook endpoint. notification is excluded from ZF unauthenticated API calls

Requesting a task from a schedule

You have now created a schedule. You'll use the name you set for it to request a task of its kind. You can do that immediately after creating it.

It's a simple HTTP POST to /requested-tasks/ passing it a JSON document with schedule_names, a list of schedule names (a single one in your case, usually).

It shall return a JSON document containing a requested key that's a list of task_id. Keep that ID, that's your task.

Deleting created schedule

Once you've received a task_id, your schedule is useless. All its information have been copied to the task and the task requested. It will be processed once a worker is available.

You can now safely delete the schedule with an HTTP DELETE to /schedules/{schedule_name}

Querying task status

Awaiting for the webhook callback or because you're not using it, you can, at any moment, query the task's status using it ID.

It's a simple HTTP GET call to /tasks/{taskId} and doesn't require authentication.

⚠️ Some privacy-concerning information are stripped off the publicly available endpoint so authenticate your requests to be sure to retrieve everything.

Removed information (check source): notification and secrets from flags (and thus commands as well)

Reference implementation

zimit-frontend (youzim.it) has a readable, working implementation.

See https://github.com/openzim/zimit-frontend/blob/main/api/src/routes/requests.py#L54