-
-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not create the ZIM when crawl is incomplete #444
Comments
Sorry I submitted this issue on accident before I was done writing. I haven't submitted a github issue before so sorry for the chunkiness :/ It seems as if the crawl just didn't finish, but I can't find any errors. If I limit the crawl to depth 1 on https://www.cdc.gov/rabies/php/protecting-public-health/index.html, all loads as expected and no crawl yaml file exists. Final note: |
What are the logs when crawler finished, just before starting warc2zim? You should have a reason there about why it is exiting, and how many items are left on the queue (would confirm what you found in yaml file). You might want to share the whole log file for us to have a look. |
The log file is quite large so I threw it in google drive as a .txt so you can view in browser. crawl-20241213215143430_ending.log.txt has the end of the logs since it's tough to scroll all the way down in google drive. I missed this before, but I see it says "Exiting, Crawl status: interrupted". It's different than the end of the terminal output which I also included in the drive folder. Perhaps my OS did something to kill the crawl... |
Also FYI this is my third attempt |
The logs says quite a lot of problems occurred:
This means the container was still alive (nothing killed the container) but it looks like the browser running inside the container had issues. This happens, Webrecorder team (which is developing the crawler) regularly makes fixes, but there are still edge cases obviously. Unless you achieve to get a good grasp of the conditions which lead to this crash (e.g. it happens always at the same moment in the crawl, or with a given page, ...), it is really hard to know how to fix this. Zimit is still using crawler 1.3.0-beta.1, I will soon upgrade to latest version they've published, maybe it can fix the issue. |
But what is true, is that it should probably not have created a ZIM but instead have failed the crawl since the ZIM is anyway incomplete, this is a problem we have to fix |
Thank you! Now that you've pointed out those errors, I'll do a bit more testing to see if I can narrow down the cause. |
Looks like the issue is this nearly 9 hour long youtube video near the bottom of this page: https://www.cdc.gov/antimicrobial-resistance/programs/AR-investments.html I ran the crawl again on a more limited scope and it definitely choked on it again. Log is attached. |
Thank you for finding the page causing the issue! I will try again once I've updated crawler version to be sure this is not already fixed before opening an upstream bug. Should be done in coming days. |
end of log - mp4 interrupt.txt I know this particular error would be outside the scope of this issue. But I wanted to share this example of another interrupt - this time on a .mp4 (seemingly to me at least). This one doesn't easily reproduce like the previous example. |
I've been experimenting with crawling cdc.gov, and I find some cdc.gov links are not be captured by the crawler when running it against the whole site. While if I run it against the particular page with the missing links, it works as expected.
Here is my full site crawl command (parameters mostly stolen from zimfarm):
docker run --rm -v path/to/Downloads:/output ghcr.io/openzim/zimit zimit --custom-css=https://drive.farm.openzim.org/zimit_custom_css/www.cdc.gov.css --description="Information of US Centers for Disease Control and Prevention" --exclude="^https:\/\/(www\.cdc\.gov\/spanish\/|www\.cdc\.gov\/.*\/es\/|espanol\.cdc\.gov\/|www\.cdc\.gov\/about\/advisory-committee-director\/meetings-archive.html)" --name="www.cdc.gov_en_all" --title="US Center for Disease Control" --url=https://www.cdc.gov/ --zim-lang=eng --scopeType host --keep
It creates a 34GB zim file. But for example on this rabies page, all of the morbidity and mortality reports are external
https://www.cdc.gov/rabies/php/protecting-public-health/index.html
None of the missing links are in the collection log file, but I do find them among others in a crawl yaml file as "queued":
The text was updated successfully, but these errors were encountered: