Skip to content

Purge phase fails if primary phase ends with empty queue #381

@seanstory

Description

@seanstory

Bug Description

Originally reported here: #172 (comment)
Easily reproduced with:

output_sink: elasticsearch
output_index: web-crawl-test

elasticsearch:
  host: http://host.docker.internal
  port: 9200
  api_key: <yourkeyhere>
  pipeline_enabled: false

domains:
  - url: https://traderjoes.com

Full error:

[2025-09-04T18:19:06.251Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] Initialized an in-memory URL queue for up to 10000 URLs
[2025-09-04T18:19:06.255Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] ES connections will be authorized with configured API key
[2025-09-04T18:19:06.287Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] Connected to ES at http://host.docker.internal:9200 - version: 9.2.0-SNAPSHOT; build flavor: default
[2025-09-04T18:19:06.550Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] Index [web-crawl-test-2] did not exist, but was successfully created!
[2025-09-04T18:19:06.550Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] Elasticsearch sink initialized for index [web-crawl-test-2] with pipeline disabled
[2025-09-04T18:19:06.561Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] Starting the primary crawl with up to 10 parallel thread(s)...
[2025-09-04T18:19:06.796Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] Following the redirect from 'https://traderjoes.com/robots.txt' to 'https://www.traderjoes.com/robots.txt'...
[2025-09-04T18:19:06.915Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] Error while fetching robots.txt for https://traderjoes.com:443: Forbidden
[2025-09-04T18:19:06.930Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] Crawl status: queue_size=0, pages_visited=1, urls_allowed=1, urls_denied={}, crawl_duration_msec=371, crawling_time_msec=113.0, avg_response_time_msec=113.0, active_threads=1, http_client={:max_connections=>100, :used_connections=>1}, status_codes={"403"=>1}
[2025-09-04T18:19:07.007Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] Crawl queue is empty, finishing the primary crawl
[2025-09-04T18:19:07.008Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] Finished a crawl stage. Result: success; Successfully finished the primary crawl with an empty crawl queue
[2025-09-04T18:19:07.027Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] Search attempt 1/4 failed: '[400] {"error":{"root_cause":[{"type":"query_shard_exception","reason":"No mapping found for [last_crawled_at] in order to sort on","index_uuid":"L4UjVa_TTvmo9O0lbQnwiA","index":"web-crawl-test-2"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"web-crawl-test-2","node":"vOvjAFReRA61cPaCDK1DXg","reason":{"type":"query_shard_exception","reason":"No mapping found for [last_crawled_at] in order to sort on","index_uuid":"L4UjVa_TTvmo9O0lbQnwiA","index":"web-crawl-test-2"}}]},"status":400}'. Retrying in 2.0s..
[2025-09-04T18:19:09.038Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] Search attempt 2/4 failed: '[400] {"error":{"root_cause":[{"type":"query_shard_exception","reason":"No mapping found for [last_crawled_at] in order to sort on","index_uuid":"L4UjVa_TTvmo9O0lbQnwiA","index":"web-crawl-test-2"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"web-crawl-test-2","node":"vOvjAFReRA61cPaCDK1DXg","reason":{"type":"query_shard_exception","reason":"No mapping found for [last_crawled_at] in order to sort on","index_uuid":"L4UjVa_TTvmo9O0lbQnwiA","index":"web-crawl-test-2"}}]},"status":400}'. Retrying in 4.0s..
[2025-09-04T18:19:13.060Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] Search attempt 3/4 failed: '[400] {"error":{"root_cause":[{"type":"query_shard_exception","reason":"No mapping found for [last_crawled_at] in order to sort on","index_uuid":"L4UjVa_TTvmo9O0lbQnwiA","index":"web-crawl-test-2"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"web-crawl-test-2","node":"vOvjAFReRA61cPaCDK1DXg","reason":{"type":"query_shard_exception","reason":"No mapping found for [last_crawled_at] in order to sort on","index_uuid":"L4UjVa_TTvmo9O0lbQnwiA","index":"web-crawl-test-2"}}]},"status":400}'. Retrying in 8.0s..
[2025-09-04T18:19:21.085Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] Search failed after 4 attempts: '[400] {"error":{"root_cause":[{"type":"query_shard_exception","reason":"No mapping found for [last_crawled_at] in order to sort on","index_uuid":"L4UjVa_TTvmo9O0lbQnwiA","index":"web-crawl-test-2"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"web-crawl-test-2","node":"vOvjAFReRA61cPaCDK1DXg","reason":{"type":"query_shard_exception","reason":"No mapping found for [last_crawled_at] in order to sort on","index_uuid":"L4UjVa_TTvmo9O0lbQnwiA","index":"web-crawl-test-2"}}]},"status":400}'.
[2025-09-04T18:19:21.086Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] Crawl Error: Unexpected error while running the crawl: Elastic::Transport::Transport::Errors::BadRequest: [400] {"error":{"root_cause":[{"type":"query_shard_exception","reason":"No mapping found for [last_crawled_at] in order to sort on","index_uuid":"L4UjVa_TTvmo9O0lbQnwiA","index":"web-crawl-test-2"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"web-crawl-test-2","node":"vOvjAFReRA61cPaCDK1DXg","reason":{"type":"query_shard_exception","reason":"No mapping found for [last_crawled_at] in order to sort on","index_uuid":"L4UjVa_TTvmo9O0lbQnwiA","index":"web-crawl-test-2"}}]},"status":400} /usr/local/bundle/gems/elastic-transport-8.3.2/lib/elastic/transport/transport/base.rb:228:in `__raise_transport_error'
/usr/local/bundle/gems/elastic-transport-8.3.2/lib/elastic/transport/transport/base.rb:346:in `perform_request'
/usr/local/bundle/gems/elastic-transport-8.3.2/lib/elastic/transport/transport/http/faraday.rb:36:in `perform_request'
/usr/local/bundle/gems/elastic-transport-8.3.2/lib/elastic/transport/client.rb:197:in `perform_request'
/usr/local/bundle/gems/elasticsearch-8.13.0/lib/elasticsearch.rb:71:in `method_missing'
/usr/local/bundle/gems/elasticsearch-api-8.13.0/lib/elasticsearch/api/actions/search.rb:105:in `search'
/home/app/lib/es/client.rb:79:in `block in paginated_search'
/home/app/lib/es/client.rb:237:in `execute_with_retry'
/home/app/lib/es/client.rb:78:in `block in paginated_search'
org/jruby/RubyKernel.java:1725:in `loop'
/home/app/lib/es/client.rb:77:in `paginated_search'
/home/app/lib/crawler/output_sink/elasticsearch.rb:123:in `fetch_purge_docs'
/home/app/lib/crawler/coordinator.rb:98:in `run_purge_crawl!'
/home/app/lib/crawler/coordinator.rb:70:in `run_crawl!'
/home/app/lib/crawler/api/crawl.rb:88:in `start!'
/home/app/lib/crawler/cli/crawl.rb:25:in `call'
/usr/local/bundle/gems/dry-cli-0.7.0/lib/dry/cli.rb:116:in `perform_registry'
/usr/local/bundle/gems/dry-cli-0.7.0/lib/dry/cli.rb:65:in `call'
bin/crawler:28:in `<main>'
[2025-09-04T18:19:21.087Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] Finished a crawl. Result: failure; Unexpected error while running the crawl, check system logs for details

Expected behavior

We should be able to catch the 404 in the purge phase and short-circuit. No need to delete things if the destination index is empty.

Environment

0.4.2, with 9.2.0-SNAPSHOT ES

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions