-
Notifications
You must be signed in to change notification settings - Fork 15
Home
- Table of Contents and Introduction
- General Description
- Crawl Managers
- Managing Hubstorage Crawl Frontiers
The initial purpose of this library is to provide classes to run worflows on ScrapyCloud and fully take advantage on it, without relying on external components required by Airflow or similar tools, from where it took inspiration.
With time it evolved to much more than that. Currently shub-workflow
is a suite of classes for defining and controlling simple and complex workflow of spiders and scripts running over zyte ScrapyCloud
platform. It also provides additional tools for performing specific, frequently needed tasks during workflows, like data delivering, job cloning, s3 and gcs storage, aws ses mailing and bases classes
for defining scripts meant to perform custom tasks in the context of a workflow.
Many of these additional components come from the harvesting, fitting and generalization of ideas coded in many different projects, developed by many people. So the net result is a library that gathers good practices and ideas from many people, with the aim to promote their standardization.
There are a couple of related libraries that frequently work together with shub-workflow
, because scrapy spiders workflows usually relies on
Hubstorage Crawl Frontier (HCF) capabilities:
- hcf-backend. A HCF (Hubstorage Cloud Frontier) backend for Frontera
- scrapy-frontera. A fully functional Scrapy scheduler compatible with Frontera.
Even more, hcf-backend
provides a crawl manager subclassed from shub-workflow
base crawl manager class, which facilitates the scheduling of consumer spiders (spiders
that consumes requests from a frontier) and can be one task of a workflow. In the present tutorial we will exemplify the usage of them too.
However, workflows defined with shub-workflow are not limited to the usage of HCF. Any storage technology can be used and mixed, and in practice it is being used for coordination of workflow pipelines with spiders and post processing scripts running on ScrapyCloud, using storage technologies like S3 or GCS for massive data exchange between them. The library also provides utils for working conveniently with those technologies in the context of the workflow pipelines built with it.
Note: This tutorial assumes appropiate knowledge of ScrapyCloud platform, how to use it, deploy code and scripts on it, etc.
pip install shub-workflow
Airflow is a platform with a server and multiple workers, and a set of tools for defining workflows running on them. Shub-workflow, on the other hand, is a set of tools for defining workflows for running on ScrapyCloud, plus a suite of base clases ready to use for defining crawl workflows. So, shub-workflow is a complement of ScrapyCloud. It is not a replacement of Airflow, neither Airflow is a replacement of shub-workflow. Shub-workflow only adds what ScrapyCloud needs in order to handle workflows.
So, the question shub-workflow vs Airflow need to consider that. The decision to use Airflow implies not only to skip usage of shub-workflow. But also to replace, partially or totally, the usage of ScrapyCloud, and to implement base components specific to crawling tasks, that were able to run on Airflow platform. In addition, even the partially replacement of ScrapyCloud will imply to have a specialized technical team for maintenance and support of Airflow servers and workers.
Shub-workflow, on the other hand, allow easy and fast deployment of crawl and post proceessing scripts workflows on ScrapyCloud. No need of external workflow platforms like Airflow.
Next Chapter: General Description