Skip to content

Validate command should output a warning if seed URL domains don't match main domain #342

@navarone-feekery

Description

@navarone-feekery

If you call bin/crawler validate <file> for a configuration like below, the response says it's a valid URL. That's because the domain is valid, however the seed URLs are invalid.

domains:
  - url: https://example.com
    seed_urls:
      - https://example2.com

The above configuration will crawl, but only because the main domain is used as a fallback seed URL. There are no warnings or errors about this misconfiguration during the crawl.

Invalid seed URLs are simply discarded when building the initial seed URL array to begin the crawl. This discarding is silent (no logs), so as a user figuring out what is wrong with my seed URLs can be very confusing.

So potential improvements:

  • bin/crawler validate should check seed URL validity
  • Invalid seed URLs should be logged

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions