Skip to content

Latest commit

 

History

History
25 lines (20 loc) · 1.14 KB

README.md

File metadata and controls

25 lines (20 loc) · 1.14 KB

spider

A microservice to crawl a set of sites by following links to pages of the relevant domains. Only the relevant host urls of the provided host(s) are considered. New URLs are entered into a GraphQL database.

Used Frameworks / Libraries

(not comprehensive, but the most important ones)

Configuration

Configuration is done using environment variables. The following configuration parameters are available.

Environment config values:

  • API_URL - GraphQL API URL (required)
  • AUTH_SECRET - GraphQL authentication secret (required)
  • SCRAPE_PARALLELISM - number of pages that crawler visits in parallel (default: 100)
  • SCRAPE_INTERVAL - time interval between page hits (default: 500ms)
  • SCRAPE_TIMEOUT - timeout of each page load attempt (default: 20.000ms)
  • SHUTDOWN_TIMEOUT - time after which spider exits, if no new URLs have been found (default: 15.000ms)
  • MAX_RETRIES - max number of retries after attempts to load a page failed (default: 0)