Skip to content
Martin Olveyra edited this page Aug 17, 2023 · 24 revisions

Introduction

What is shub-workflow

The initial purpose of this library is to provide classes to run worflows on ScrapyCloud and fully take advantage on it, without relying on external components required by Airflow or similar tools, from where it took inspiration.

With time it evolved to much more than that. Currently shub-workflow is a suite of classes for defining and controlling simple and complex workflow of spiders and scripts running over zyte ScrapyCloud platform. It also provides additional tools for performing specific, frequently needed tasks during workflows, like data delivering, job cloning, s3 and gcs storage, aws ses mailing and bases classes for defining scripts meant to perform custom tasks in the context of a workflow.

Many of these additional components come from the harvesting, fitting and generalization of ideas coded in many different projects, developed by many people. So the net result is a library that gathers good practices and ideas from many people, with the aim to promote their standardization.

There are a couple of related libraries that frequently work together with shub-workflow, because scrapy spiders workflows usually relies on Hubstorage Crawl Frontier (HCF) capabilities:

Even more, hcf-backend provides a crawl manager subclassed from shub-workflow base crawl manager class, which facilitates the scheduling of consumer spiders (spiders that consumes requests from a frontier) and can be one task of a workflow. In the present tutorial we will exemplify the usage of them too.

However, workflows defined with shub-workflow are not limited to the usage of HCF. Any storage technology can be used and mixed, and in practice it is being used for coordination of workflow pipelines with spiders and post processing scripts running on ScrapyCloud, using storage technologies like S3 or GCS for massive data exchange between them. The library also provides utils for working conveniently with those technologies in the context of the workflow pipelines built with it.

Note: This tutorial assumes appropiate knowledge of ScrapyCloud platform, how to use it, deploy code and scripts on it, etc.

Installation

pip install shub-workflow

Shub-workflow vs Airflow

Airflow is a platform with a server and multiple workers, and a set of tools for defining workflows running on them. Shub-workflow, on the other hand, is a set of tools for defining workflows for running on ScrapyCloud, plus a suite of base clases ready to use for defining crawl workflows. So, shub-workflow is a complement of ScrapyCloud, which is already a platform for running jobs in parallel, specialized on crawling tasks. Shub-workflow is not a replacement of Airflow, neither Airflow is a replacement of shub-workflow. Shub-workflow only adds what ScrapyCloud needs in order to handle workflows.

So, the question shub-workflow vs Airflow is misspointed. The decision to use Airflow implies not only to skip usage of shub-workflow. But also to replace, partially or totally, the usage of ScrapyCloud, and to reimplement base components specific to crawling tasks, which ScrapyCloud already does very well. In addition, even the partially replacement of ScrapyCloud will imply to have a specialized technical team for maintenance and support of Airflow servers and workers.

Shub-workflow, on the other hand, allow easy and fast deployment of crawl and post proceessing scripts workflows on ScrapyCloud. No need of external workflow platforms like Airflow. Using Airflow along with ScrapyCloud in a project in order to run workflows, will imply to duplicate in great extent what ScrapyCloud already does greatly. Airflow is rather a duplication or replacement of ScrapyCloud, not of shub-workflow.

In conclusion, shub-workflow implements some features that Airflow already performs. But the decision to use Airflow instead of shub-workflow is misguided as you would add lots of costs for replacing, reimplementing and duplicating ScrapyCloud features too. The only scenarios where you would prefer to use Airflow is when, for any reason, you decided not to use the specialized capabilities of ScrapyCloud.

Crawl Managers and Graph Managers

shub-workflow provides two base classes of workflow managers: crawl managers and graph managers.

Crawl managers manage multiple consecutive or parallel jobs of spiders, While graph manager manages a tree of conditional tasks, where each task can be any kind of script (delivery scripts, post processing scripts, crawl managers, etc.) or a spider job (not common usage, as in most use cases a crawl manager fits better for scheduling spiders). Every workflow, either managed by a crawl manager or a graph one, is identified by a flow id and a name, that you can see in the manager job tags (FLOW_ID=... and NAME=... respectivelly). These tags are required for tracking the progress of a worflow instance, and resuming when required.

Both crawl managers and graph managers can be resumed. The first concept that need to be grasped is that there is not a generic way to resume every workflow. Resuming actions depend on the specific application. However, there are some generic basic resuming actions that are common to all cases and that are enough for most ones. So there are common approaches that can be taken on each one in order to minimize the custom resuming engineering and maximize the usefulness of the support provided by shub-workflow library, and in most cases the default resume hooks are enough. But more complicated cases may need to override them.

When you schedule a new manager, the flow id is autogenerated. The name, on the contrary, must always be provided from the start, either as hardcoded attribute or as a command line option. But if you want to resume a crawl manager or a graph manager that stopped for any reason, you can just clone the stopped job. This operation will conserve the FLOW_ID. Setting flow id from the start sets the manager in resume mode, and the default resume hooks search for running and finished jobs belonging to the same workflow, in order to acquire them. If you are not running a manager in ScrapyCloud, you can also use the command line option --flow-id (and --name if required) in order to set the flow id from start and force resume mode.

Once all the running/finished jobs are searched for, found and acquired by the manager, it will know everything it needs in order to avoid to schedule again jobs with parameters or tasks that were already sent, and in order to avoid to schedule more jobs that the max running jobs set for the manager. So, the result will be a clean resume.


Next Chapter: Credentials Setup