a few questions about Cloudflare check timed out and Link extraction timed out #441

hamoudak · 2024-12-05T16:20:18Z

when I crawl some websites I have those two messages a long the crawl; is this something to worry about ? while "failed" is :0.
am crawling with 4 workers. and to let you know I have tested one has these issues but "resolve redirect" was ok, while the other displayed errors with some links (pages) not in the zimfile path. . If this is not normal how to deal with these issues?
is this related to the Resume failed browsertrix crawls enhancement.

benoit74 · 2024-12-06T15:51:22Z

Please share same logs (redacting the hostname if needed) so that we can get a better grasp of what you're refering too

hamoudak · 2024-12-08T19:03:52Z

am sorry for the late reply; I was crawling some stuff that forced me to wait. this is a log file for testing . I have noticed that when it gets to a book with many pages, around 2500-3000 and a bove, these messages show up, not happening with a little pages.

crawl-20241208075540177.log

here is another one : log-information.txt

the same messages but fails. I have tried to crawl https://al-maktaba.org/book/31617 , also for testing but the crawler can't get the url as expected except If I use no workers at all. when I crawl with just 2 workers I get a message like Unable to get new page, browser likely crashed #400

but I have archived some successful crawls for the same domain with 4 workers and with the same messages above.

benoit74 · 2024-12-09T07:51:15Z

Thanks. All this tend to show that browsertrix crawler is a bit instable in your scenarii. We will have to dig into it.

hamoudak · 2024-12-23T12:30:40Z

have you found any solutions? do i have to download the new version of zimit . I've tried to know the reason but simply when I browse these big books online it (pages) loads a little bit slowly, so the crawler acts the the same and show up these messages. some websites when it has heavy content I get these messages too; with workers 4 . am trying to lower the workers with them, that's all.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a few questions about Cloudflare check timed out and Link extraction timed out #441

a few questions about Cloudflare check timed out and Link extraction timed out #441

hamoudak commented Dec 5, 2024 •

edited

Loading

benoit74 commented Dec 6, 2024

hamoudak commented Dec 8, 2024 •

edited

Loading

benoit74 commented Dec 9, 2024

hamoudak commented Dec 23, 2024

a few questions about Cloudflare check timed out and Link extraction timed out #441

a few questions about Cloudflare check timed out and Link extraction timed out #441

Comments

hamoudak commented Dec 5, 2024 • edited Loading

benoit74 commented Dec 6, 2024

hamoudak commented Dec 8, 2024 • edited Loading

benoit74 commented Dec 9, 2024

hamoudak commented Dec 23, 2024

hamoudak commented Dec 5, 2024 •

edited

Loading

hamoudak commented Dec 8, 2024 •

edited

Loading