Skip to content

elasticsearch.delay_on_retry bulk index attempt time unit mismatch (miliseconds vs seconds) #380

@jheym

Description

@jheym

Bug Description

When setting elasticsearch.delay_on_retry in the elastic config, the crawler waits the set amount (60 seconds), but on the next retry attempt (retry attempt 2), it goes all the way to 3600 seconds. I imagine this is not intended? Also considering the config example seems to suggest this unit is in miliseconds.

[2025-08-29T20:38:33.142Z] [crawl:68b20a5b89397c058c857e29] [primary] Crawl status: queue_size=5457, pages_visited=1755, urls_allowed=7212, urls_denied={:already_seen=>646398, :incorrect_protocol=>2806, :domain_filter_denied=>9615}, crawl_duration_msec=1388248, crawling_time_msec=134416.0, avg_response_time_msec=76.59031339031338, active_threads=4, http_client={:max_connections=>100, :used_connections=>2}, status_codes={"404"=>7, "200"=>1451, "301"=>294, "302"=>3}
[2025-08-29T20:38:35.088Z] [crawl:68b20a5b89397c058c857e29] [primary] Sending bulk request with 7 items and resetting queue...
[2025-08-29T20:38:43.191Z] [crawl:68b20a5b89397c058c857e29] [primary] Crawl status: queue_size=5456, pages_visited=1757, urls_allowed=7212, urls_denied={:already_seen=>646398, :incorrect_protocol=>2806, :domain_filter_denied=>9615}, crawl_duration_msec=1398297, crawling_time_msec=134562.0, avg_response_time_msec=76.5862265224815, active_threads=4, http_client={:max_connections=>100, :used_connections=>2}, status_codes={"404"=>7, "200"=>1453, "301"=>294, "302"=>3}
[2025-08-29T20:38:46.425Z] [crawl:68b20a5b89397c058c857e29] [primary] Bulk index attempt 1/6 failed: 'Connection reset by peer - Connection reset by peer'. Retrying in 60.0s..
[2025-08-29T20:38:53.195Z] [crawl:68b20a5b89397c058c857e29] [primary] Crawl status: queue_size=5456, pages_visited=1757, urls_allowed=7212, urls_denied={:already_seen=>646398, :incorrect_protocol=>2806, :domain_filter_denied=>9615}, crawl_duration_msec=1408302, crawling_time_msec=134562.0, avg_response_time_msec=76.5862265224815, active_threads=4, http_client={:max_connections=>100, :used_connections=>2}, status_codes={"404"=>7, "200"=>1453, "301"=>294, "302"=>3}
[2025-08-29T20:39:03.201Z] [crawl:68b20a5b89397c058c857e29] [primary] Crawl status: queue_size=5456, pages_visited=1757, urls_allowed=7212, urls_denied={:already_seen=>646398, :incorrect_protocol=>2806, :domain_filter_denied=>9615}, crawl_duration_msec=1418306, crawling_time_msec=134562.0, avg_response_time_msec=76.5862265224815, active_threads=4, http_client={:max_connections=>100, :used_connections=>2}, status_codes={"404"=>7, "200"=>1453, "301"=>294, "302"=>3}
[2025-08-29T20:39:13.204Z] [crawl:68b20a5b89397c058c857e29] [primary] Crawl status: queue_size=5456, pages_visited=1757, urls_allowed=7212, urls_denied={:already_seen=>646398, :incorrect_protocol=>2806, :domain_filter_denied=>9615}, crawl_duration_msec=1428311, crawling_time_msec=134562.0, avg_response_time_msec=76.5862265224815, active_threads=4, http_client={:max_connections=>100, :used_connections=>2}, status_codes={"404"=>7, "200"=>1453, "301"=>294, "302"=>3}
[2025-08-29T20:39:23.208Z] [crawl:68b20a5b89397c058c857e29] [primary] Crawl status: queue_size=5456, pages_visited=1757, urls_allowed=7212, urls_denied={:already_seen=>646398, :incorrect_protocol=>2806, :domain_filter_denied=>9615}, crawl_duration_msec=1438315, crawling_time_msec=134562.0, avg_response_time_msec=76.5862265224815, active_threads=4, http_client={:max_connections=>100, :used_connections=>2}, status_codes={"404"=>7, "200"=>1453, "301"=>294, "302"=>3}
[2025-08-29T20:39:33.213Z] [crawl:68b20a5b89397c058c857e29] [primary] Crawl status: queue_size=5456, pages_visited=1757, urls_allowed=7212, urls_denied={:already_seen=>646398, :incorrect_protocol=>2806, :domain_filter_denied=>9615}, crawl_duration_msec=1448320, crawling_time_msec=134562.0, avg_response_time_msec=76.5862265224815, active_threads=4, http_client={:max_connections=>100, :used_connections=>2}, status_codes={"404"=>7, "200"=>1453, "301"=>294, "302"=>3}
[2025-08-29T20:39:43.221Z] [crawl:68b20a5b89397c058c857e29] [primary] Crawl status: queue_size=5456, pages_visited=1757, urls_allowed=7212, urls_denied={:already_seen=>646398, :incorrect_protocol=>2806, :domain_filter_denied=>9615}, crawl_duration_msec=1458328, crawling_time_msec=134562.0, avg_response_time_msec=76.5862265224815, active_threads=4, http_client={:max_connections=>100, :used_connections=>2}, status_codes={"404"=>7, "200"=>1453, "301"=>294, "302"=>3}
[2025-08-29T20:39:53.233Z] [crawl:68b20a5b89397c058c857e29] [primary] Crawl status: queue_size=5456, pages_visited=1757, urls_allowed=7212, urls_denied={:already_seen=>646398, :incorrect_protocol=>2806, :domain_filter_denied=>9615}, crawl_duration_msec=1468339, crawling_time_msec=134562.0, avg_response_time_msec=76.5862265224815, active_threads=4, http_client={:max_connections=>100, :used_connections=>2}, status_codes={"404"=>7, "200"=>1453, "301"=>294, "302"=>3}
[2025-08-29T20:39:56.747Z] [crawl:68b20a5b89397c058c857e29] [primary] Bulk index attempt 2/6 failed: 'Connection reset by peer - Connection reset by peer'. Retrying in 3600.0s..
[2025-08-29T20:40:03.236Z] [crawl:68b20a5b89397c058c857e29] [primary] Crawl status: queue_size=5456, pages_visited=1757, urls_allowed=7212, urls_denied={:already_seen=>646398, :incorrect_protocol=>2806, :domain_filter_denied=>9615}, crawl_duration_msec=1478344, crawling_time_msec=134562.0, avg_response_time_msec=76.5862265224815, active_threads=4, http_client={:max_connections=>100, :used_connections=>2}, status_codes={"404"=>7, "200"=>1453, "301"=>294, "302"=>3}
[2025-08-29T20:40:13.240Z] [crawl:68b20a5b89397c058c857e29] [primary] Crawl status: queue_size=5456, pages_visited=1757, urls_allowed=7212, urls_denied={:already_seen=>646398, :incorrect_protocol=>2806, :domain_filter_denied=>9615}, crawl_duration_msec=1488347, crawling_time_msec=134562.0, avg_response_time_msec=76.5862265224815, active_threads=4, http_client={:max_connections=>100, :used_connections=>2}, status_codes={"404"=>7, "200"=>1453, "301"=>294, "302"=>3}
[2025-08-29T20:40:23.243Z] [crawl:68b20a5b89397c058c857e29] [primary] Crawl status: queue_size=5456, pages_visited=1757, urls_allowed=7212, urls_denied={:already_seen=>646398, :incorrect_protocol=>2806, :domain_filter_denied=>9615}, crawl_duration_msec=1498351, crawling_time_msec=134562.0, avg_response_time_msec=76.5862265224815, active_threads=4, http_client={:max_connections=>100, :used_connections=>2}, status_codes={"404"=>7, "200"=>1453, "301"=>294, "302"=>3}
[2025-08-29T20:40:33.247Z] [crawl:68b20a5b89397c058c857e29] [primary] Crawl status: queue_size=5456, pages_visited=1757, urls_allowed=7212, urls_denied={:already_seen=>646398, :incorrect_protocol=>2806, :domain_filter_denied=>9615}, crawl_duration_msec=1508354, crawling_time_msec=134562.0, avg_response_time_msec=76.5862265224815, active_threads=4, http_client={:max_connections=>100, :used_connections=>2}, status_codes={"404"=>7, "200"=>1453, "301"=>294, "302"=>3}
[2025-08-29T20:40:43.250Z] [crawl:68b20a5b89397c058c857e29] [primary] Crawl status: queue_size=5456, pages_visited=1757, urls_allowed=7212, urls_denied={:already_seen=>646398, :incorrect_protocol=>2806, :domain_filter_denied=>9615}, crawl_duration_msec=1518357, crawling_time_msec=134562.0, avg_response_time_msec=76.5862265224815, active_threads=4, http_client={:max_connections=>100, :used_connections=>2}, status_codes={"404"=>7, "200"=>1453, "301"=>294, "302"=>3}

Expected behavior

If setting delay_on_retry to 60, I expect the retry attempt to wait 60 seconds after failure. If the first retry attempt fails, the next one should be doubled to 120, then 240, and so on, depending on the number of retries specified by elasticsearch.retry_on_failure

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions