Do not create the ZIM when crawl is incomplete #444

john8952 · 2024-12-16T04:48:14Z

I've been experimenting with crawling cdc.gov, and I find some cdc.gov links are not be captured by the crawler when running it against the whole site. While if I run it against the particular page with the missing links, it works as expected.

Here is my full site crawl command (parameters mostly stolen from zimfarm):
docker run --rm -v path/to/Downloads:/output ghcr.io/openzim/zimit zimit --custom-css=https://drive.farm.openzim.org/zimit_custom_css/www.cdc.gov.css --description="Information of US Centers for Disease Control and Prevention" --exclude="^https:\/\/(www\.cdc\.gov\/spanish\/|www\.cdc\.gov\/.*\/es\/|espanol\.cdc\.gov\/|www\.cdc\.gov\/about\/advisory-committee-director\/meetings-archive.html)" --name="www.cdc.gov_en_all" --title="US Center for Disease Control" --url=https://www.cdc.gov/ --zim-lang=eng --scopeType host --keep

It creates a 34GB zim file. But for example on this rabies page, all of the morbidity and mortality reports are external
https://www.cdc.gov/rabies/php/protecting-public-health/index.html

None of the missing links are in the collection log file, but I do find them among others in a crawl yaml file as "queued":

The text was updated successfully, but these errors were encountered:

john8952 · 2024-12-16T04:56:31Z

Sorry I submitted this issue on accident before I was done writing. I haven't submitted a github issue before so sorry for the chunkiness :/
Continuing here:

It seems as if the crawl just didn't finish, but I can't find any errors. If I limit the crawl to depth 1 on https://www.cdc.gov/rabies/php/protecting-public-health/index.html, all loads as expected and no crawl yaml file exists.

Final note:
The URL in the second screenshot is produced when I manually edit the URL, normally it just goes external. I also edited the URL incorrectly, here is the proper one:

benoit74 · 2024-12-16T07:27:05Z

What are the logs when crawler finished, just before starting warc2zim? You should have a reason there about why it is exiting, and how many items are left on the queue (would confirm what you found in yaml file). You might want to share the whole log file for us to have a look.

john8952 · 2024-12-16T16:18:56Z

The log file is quite large so I threw it in google drive as a .txt so you can view in browser.
https://drive.google.com/drive/folders/1tQKAqK9EZMgiIwroPfiY8nuOOLPZQNaa?usp=sharing

crawl-20241213215143430_ending.log.txt has the end of the logs since it's tough to scroll all the way down in google drive.

I missed this before, but I see it says "Exiting, Crawl status: interrupted". It's different than the end of the terminal output which I also included in the drive folder.

Perhaps my OS did something to kill the crawl...

john8952 · 2024-12-16T18:01:04Z

Also FYI this is my third attempt

benoit74 · 2024-12-17T08:25:27Z

The logs says quite a lot of problems occurred:

multiple Direct fetch of page URL timed out in a row
and finally Browser disconnected (crashed?), interrupting crawl

This means the container was still alive (nothing killed the container) but it looks like the browser running inside the container had issues. This happens, Webrecorder team (which is developing the crawler) regularly makes fixes, but there are still edge cases obviously. Unless you achieve to get a good grasp of the conditions which lead to this crash (e.g. it happens always at the same moment in the crawl, or with a given page, ...), it is really hard to know how to fix this. Zimit is still using crawler 1.3.0-beta.1, I will soon upgrade to latest version they've published, maybe it can fix the issue.

benoit74 · 2024-12-17T08:29:19Z

But what is true, is that it should probably not have created a ZIM but instead have failed the crawl since the ZIM is anyway incomplete, this is a problem we have to fix

john8952 · 2024-12-17T15:12:43Z

Thank you! Now that you've pointed out those errors, I'll do a bit more testing to see if I can narrow down the cause.

john8952 · 2024-12-18T22:45:14Z

crawl-20241217234541358.log

Looks like the issue is this nearly 9 hour long youtube video near the bottom of this page: https://www.cdc.gov/antimicrobial-resistance/programs/AR-investments.html

I ran the crawl again on a more limited scope and it definitely choked on it again. Log is attached.

benoit74 · 2024-12-19T10:17:41Z

Thank you for finding the page causing the issue! I will try again once I've updated crawler version to be sure this is not already fixed before opening an upstream bug. Should be done in coming days.

john8952 · 2025-01-04T04:11:34Z

end of log - mp4 interrupt.txt

I know this particular error would be outside the scope of this issue. But I wanted to share this example of another interrupt - this time on a .mp4 (seemingly to me at least). This one doesn't easily reproduce like the previous example.

benoit74 changed the title ~~missing links/incomplete crawl~~ Do not create the ZIM when crawl is incomplete Dec 17, 2024

john8952 mentioned this issue Jan 7, 2025

CDC Health Topics openzim/zim-requests#994

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not create the ZIM when crawl is incomplete #444

Do not create the ZIM when crawl is incomplete #444

john8952 commented Dec 16, 2024

john8952 commented Dec 16, 2024

benoit74 commented Dec 16, 2024

john8952 commented Dec 16, 2024

john8952 commented Dec 16, 2024

benoit74 commented Dec 17, 2024

benoit74 commented Dec 17, 2024

john8952 commented Dec 17, 2024

john8952 commented Dec 18, 2024

benoit74 commented Dec 19, 2024

john8952 commented Jan 4, 2025

Do not create the ZIM when crawl is incomplete #444

Do not create the ZIM when crawl is incomplete #444

Comments

john8952 commented Dec 16, 2024

john8952 commented Dec 16, 2024

benoit74 commented Dec 16, 2024

john8952 commented Dec 16, 2024

john8952 commented Dec 16, 2024

benoit74 commented Dec 17, 2024

benoit74 commented Dec 17, 2024

john8952 commented Dec 17, 2024

john8952 commented Dec 18, 2024

benoit74 commented Dec 19, 2024

john8952 commented Jan 4, 2025