Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not create the ZIM when crawl is incomplete #444

Open
john8952 opened this issue Dec 16, 2024 · 10 comments
Open

Do not create the ZIM when crawl is incomplete #444

john8952 opened this issue Dec 16, 2024 · 10 comments

Comments

@john8952
Copy link

I've been experimenting with crawling cdc.gov, and I find some cdc.gov links are not be captured by the crawler when running it against the whole site. While if I run it against the particular page with the missing links, it works as expected.

Here is my full site crawl command (parameters mostly stolen from zimfarm):
docker run --rm -v path/to/Downloads:/output ghcr.io/openzim/zimit zimit --custom-css=https://drive.farm.openzim.org/zimit_custom_css/www.cdc.gov.css --description="Information of US Centers for Disease Control and Prevention" --exclude="^https:\/\/(www\.cdc\.gov\/spanish\/|www\.cdc\.gov\/.*\/es\/|espanol\.cdc\.gov\/|www\.cdc\.gov\/about\/advisory-committee-director\/meetings-archive.html)" --name="www.cdc.gov_en_all" --title="US Center for Disease Control" --url=https://www.cdc.gov/ --zim-lang=eng --scopeType host --keep

It creates a 34GB zim file. But for example on this rabies page, all of the morbidity and mortality reports are external
https://www.cdc.gov/rabies/php/protecting-public-health/index.html
Image

Image

None of the missing links are in the collection log file, but I do find them among others in a crawl yaml file as "queued":

@john8952
Copy link
Author

Sorry I submitted this issue on accident before I was done writing. I haven't submitted a github issue before so sorry for the chunkiness :/
Continuing here:
Image

It seems as if the crawl just didn't finish, but I can't find any errors. If I limit the crawl to depth 1 on https://www.cdc.gov/rabies/php/protecting-public-health/index.html, all loads as expected and no crawl yaml file exists.

Final note:
The URL in the second screenshot is produced when I manually edit the URL, normally it just goes external. I also edited the URL incorrectly, here is the proper one:
Image

@benoit74
Copy link
Collaborator

What are the logs when crawler finished, just before starting warc2zim? You should have a reason there about why it is exiting, and how many items are left on the queue (would confirm what you found in yaml file). You might want to share the whole log file for us to have a look.

@john8952
Copy link
Author

The log file is quite large so I threw it in google drive as a .txt so you can view in browser.
https://drive.google.com/drive/folders/1tQKAqK9EZMgiIwroPfiY8nuOOLPZQNaa?usp=sharing

crawl-20241213215143430_ending.log.txt has the end of the logs since it's tough to scroll all the way down in google drive.

I missed this before, but I see it says "Exiting, Crawl status: interrupted". It's different than the end of the terminal output which I also included in the drive folder.

Perhaps my OS did something to kill the crawl...

@john8952
Copy link
Author

Also FYI this is my third attempt

@benoit74
Copy link
Collaborator

The logs says quite a lot of problems occurred:

  • multiple Direct fetch of page URL timed out in a row
  • and finally Browser disconnected (crashed?), interrupting crawl

This means the container was still alive (nothing killed the container) but it looks like the browser running inside the container had issues. This happens, Webrecorder team (which is developing the crawler) regularly makes fixes, but there are still edge cases obviously. Unless you achieve to get a good grasp of the conditions which lead to this crash (e.g. it happens always at the same moment in the crawl, or with a given page, ...), it is really hard to know how to fix this. Zimit is still using crawler 1.3.0-beta.1, I will soon upgrade to latest version they've published, maybe it can fix the issue.

@benoit74
Copy link
Collaborator

But what is true, is that it should probably not have created a ZIM but instead have failed the crawl since the ZIM is anyway incomplete, this is a problem we have to fix

@benoit74 benoit74 changed the title missing links/incomplete crawl Do not create the ZIM when crawl is incomplete Dec 17, 2024
@john8952
Copy link
Author

Thank you! Now that you've pointed out those errors, I'll do a bit more testing to see if I can narrow down the cause.

@john8952
Copy link
Author

crawl-20241217234541358.log

Looks like the issue is this nearly 9 hour long youtube video near the bottom of this page: https://www.cdc.gov/antimicrobial-resistance/programs/AR-investments.html

I ran the crawl again on a more limited scope and it definitely choked on it again. Log is attached.

@benoit74
Copy link
Collaborator

Thank you for finding the page causing the issue! I will try again once I've updated crawler version to be sure this is not already fixed before opening an upstream bug. Should be done in coming days.

@john8952
Copy link
Author

john8952 commented Jan 4, 2025

end of log - mp4 interrupt.txt

I know this particular error would be outside the scope of this issue. But I wanted to share this example of another interrupt - this time on a .mp4 (seemingly to me at least). This one doesn't easily reproduce like the previous example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants