-
-
Notifications
You must be signed in to change notification settings - Fork 25
On‐demand tasks creation
Zimfarm's core is a machinery to create ZIM files from recipes (schedules
). It is mainly used via it's JS frontend (https://farm.openzim.org) focused on the Kiwix Catalog, which is a large set of routinely created ZIM files from identified resources.
Zimfarm can also be leveraged for one-shot tasks that produce unique ZIM which won't make it to the Kiwix Catalog. That's the case for the ZIM produced by:
- youzim.it, to create ZIM from generic web sites.
- wp1, to create ZIM from custom selections of Wikimedia project articles
- [WIP] nautilus, to create ZIM from a collection of user-uploaded files.
This document gives pointers to help such projects interact with Zimfarm.
- A Zimfarm endpoint. Can be
https://api.farm.youzim.it/v1
(most likely) orhttps://farm.openzim.org/v1
- A
manager
account on said API (username
andpassword
) - An S3 setup.
- Usually those services upload to S3 buckets
- S3 bucket
- S3 credentials
- Bucket has an automatic expiration (files automatically deleted after
x
nb of days
- ZF worker(s) are configured to accept your tasks (
offliner
) and are on the S3 upload whitelist for the bucket
ZF was initially created for routine ZIM creation. Due to this historic requirement, one cannot simply request a task but must first create a schedule and request a task for the created schedule. Schedule can then be removed.
A typical usage follows:
- Authenticating
- Creating a schedule
- Requesting a
task_id
on created schedule - Deleting created schedule
- [async]
- querying the status of the task
- accepting the callback HTTP request
Zimfarm API is mostly public, in read-only mode. Requesting tasks obviously requires authentication. Check the API documentation or refer to this simple implementation in Python
This is the part that required the most care and the most knowledge of Zimfarm. It's basically contains everything about what you want to do.
It's as simple as a JSON POST to /schedules
(201
HTTP response) but the content needs attention
{
"name": "Unique name",
"language": {"code": "eng", "name_en": "English", "name_native": "English"},
"category": "other",
"periodicity": "manually",
"tags": [],
"enabled": true,
"config": ...,
"notification": ...
}
-
name
is a unique name for your temporary schedule. Add some randomness but include some human-relevant part as well to ease debugging. -
language
is a ZF only property used only on the farm UI so just use English as described. It has no purpose here. -
category
same as forlanguage
. -
periodicity
: set it tomanually
otherwise the scheduler might attempt to run it again (if not deleted by then) -
tags
same aslanguage
-
enabled
set it totrue
or your request call will fail. This says the schedule is usable. -
config
the meat of the schedule, detailed below _notification
configures the webhook callback (see below)
This sets all the details about the task to be created (actually, config
is copied over from schedule to task).
{
"task_name": "mwoffliner",
"warehouse_path": "/other",
"image": {
"name": "ghcr.io/openzim/mwoffliner",
"tag": "latest",
},
"resources": {
"cpu": 3,
"memory": 1000000000,
"disk": 1000000000,
"shm": 1000000000,
"cap_add": ["SYS_ADMIN", "NET_ADMIN"],
},
"platform": null,
"monitor": false,
"flags": {"flag": "flag-value"},
}
-
task_name
is the name of the offliner/scraper that will be called to create a ZIM off your request. Check the list of supported ones. Warning: worker machines configures a list of offliner they accept to run. Make sure there will be workers for your tasks. -
warehouse_path
(mandatory) sub-folder inside the upload destination (or a prefix for your S3 uploads in your bucket) -
image
: OCI (Docker) image name and tag. Use the ghcr.io hosted images built by openzim (others wont work) but choose your tag freely. On-demand services tend to uselatest
. -
resources
:cpu
,memory
anddisk
are mandatory. this works as a reservation of requests on the worker. If you request 4GB or RAM for a schedule/task, the task won't be run until a complying worker with 4GB free RAM is available. Experiment to find out about what resources you need and requests the minimum. -
resources.cpu
it's an integer. Use3
for historical reason as the algorithm expects it. Not much use. -
resources.memory
andresources.disk
requested amount of RAM and disk space, in bytes. -
shm
only include the key is required. Amount of (additional) RAM space to mount to/dev/shm
on scraper's container. -
cap_add
only include key if required. Specific extra docker permissions to assign to the scraper's container. Not supported by all workers -
platform
set it if the scraper you want to run is subject to one (see Platforms list). It adds further restrictions on concurrent tasks runs. -
monitor
set it tofalse
, it's a debugging tool that's not supported by all workers -
flags
dict of cli parameters for the scraper/offliner. Each offliner is defined in its own file here. Required arguments are marked as such. Boolean args (--verbose
) are set as"--verbose": true
for instance.
ZF is able trigger notifications on some task events. It supports e-mail (sent as Zimfarm) via mailgun
and slack
but third party services usually rely on the webhook
. It's an HTTP POST request sent by ZF with the task
document sent as JSON payload (same as if retrieved from /tasks/{taskId}
{
"requested": {"webhook": "http://..."},
"ended": {"webhook": "https://..."}
}
-
requested
andended
are most likely the events you want to be informed about. See the list of events. Note that all those matching.complete()
(failed
,succeeded
,canceled
) must be declared here asended
.
It is common to include some kind of credentials to this URL to prevent people from abusing your webhook endpoint. notification
is excluded from ZF unauthenticated API calls
You have now created a schedule. You'll use the name
you set for it to request a task of its kind. You can do that immediately after creating it.
It's a simple HTTP POST to /requested-tasks/
passing it a JSON document with schedule_names
, a list of schedule names (a single one in your case, usually).
It shall return a JSON document containing a requested
key that's a list of task_id
. Keep that ID, that's your task.
Once you've received a task_id
, your schedule is useless. All its information have been copied to the task and the task requested. It will be processed once a worker is available.
You can now safely delete the schedule with an HTTP DELETE to /schedules/{schedule_name}
Awaiting for the webhook callback or because you're not using it, you can, at any moment, query the task's status using it ID.
It's a simple HTTP GET call to /tasks/{taskId}
and doesn't require authentication.
Removed information (check source): notification
and secrets from flags (and thus commands as well)
zimit-frontend (youzim.it) has a readable, working implementation.
See https://github.com/openzim/zimit-frontend/blob/main/api/src/routes/requests.py#L54