Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a few questions about Cloudflare check timed out and Link extraction timed out #441

Open
hamoudak opened this issue Dec 5, 2024 · 4 comments

Comments

@hamoudak
Copy link

hamoudak commented Dec 5, 2024

when I crawl some websites I have those two messages a long the crawl; is this something to worry about ? while "failed" is :0.
am crawling with 4 workers. and to let you know I have tested one has these issues but "resolve redirect" was ok, while the other displayed errors with some links (pages) not in the zimfile path. . If this is not normal how to deal with these issues?
is this related to the Resume failed browsertrix crawls enhancement.

@benoit74
Copy link
Collaborator

benoit74 commented Dec 6, 2024

Please share same logs (redacting the hostname if needed) so that we can get a better grasp of what you're refering too

@hamoudak
Copy link
Author

hamoudak commented Dec 8, 2024

am sorry for the late reply; I was crawling some stuff that forced me to wait. this is a log file for testing . I have noticed that when it gets to a book with many pages, around 2500-3000 and a bove, these messages show up, not happening with a little pages.

crawl-20241208075540177.log

here is another one : log-information.txt

the same messages but fails. I have tried to crawl https://al-maktaba.org/book/31617 , also for testing but the crawler can't get the url as expected except If I use no workers at all. when I crawl with just 2 workers I get a message like Unable to get new page, browser likely crashed #400

but I have archived some successful crawls for the same domain with 4 workers and with the same messages above.

@benoit74
Copy link
Collaborator

benoit74 commented Dec 9, 2024

Thanks. All this tend to show that browsertrix crawler is a bit instable in your scenarii. We will have to dig into it.

@hamoudak
Copy link
Author

have you found any solutions? do i have to download the new version of zimit . I've tried to know the reason but simply when I browse these big books online it (pages) loads a little bit slowly, so the crawler acts the the same and show up these messages. some websites when it has heavy content I get these messages too; with workers 4 . am trying to lower the workers with them, that's all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants