-
-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider "new" crawler CLI arguments #433
Comments
I think we should be cautious about changing browsertrix defaults in zimit. Most of the time, it doesn't matter in zimit but only matters in our use of zimit hence the change should be in zimit conf and/or zimfarm offliner. That's the case for On |
I like a lot the idea of keeping strict compatibility / transparency with browsertrix crawler. I was mislead by the fact that We need to also expose these arguments which I missed at first look:
|
why you don't add this one : --combineWARC, it'll be useful because most crawling tasks use --keep . |
In zimit, combining WARCs is just a significant waste of computing resources and storage, since it means all records have to be parsed, and transferred to a new "combined" file. The combined file is useful only if you want to transport the WARC as one single file, which is not our use case. Most crawling tasks use I'm not against exposing this parameter, but I don't get what would be the usage for us or for an end-user |
I didn't know much about this, but in my case when I open for example the first warc of two warcs in the archive folder, It doesn't display the whole content with replay web.page desktop app, it needs the other one, so I have searched and found it as an argument in browsertrix. so I thought to tell you how to deal with this. ilya talks: A quick solution is to combine all WARC files into one, which can be done via command-line, for example: but how to do this on windows command line; I have no idea. Edit: I have used the command "type" and I got a working combined file. could you add this parameter for |
there is an important thing I see that it needs some attention to be implemented according to this proposal; webrecorder/browsertrix-crawler#132 |
We have some "new" (some are few months old ...) CLI argument of browsertrix crawler to consider:
For seed urls, I propose to use
--seedFile
, and (if not already the case) support a URL from which to fetch this file (to be done in browsertrix crawler directly preferably).For
--failOnFailedLimit
and--failOnInvalidStatus
, I think we should expose these two arguments and changing their defaults values:100
for--failOnFailedLimit
andtrue
for--failOnInvalidStatus
. Both would be sensible defaults to warn the user something bad is happening and they should confirm they want to continue. If we agree on this, and since having atrue
default on a boolean flag prevent from unsetting it, we should expose--doNotFailOnInvalidStatus
at zimit level, instead of--failOnInvalidStatus
.For sitemap arguments, I propose to use
--sitemapFromDate
and--sitemapToDate
for clarity (plus they are the real name used, the variant is an alias in browsertrix crawler codebase).For
--selectLinks
, we need to expose this CLI argument and (contrary to what I said on Monday) modify the default value toa[href]->href,area[href]->href
(users would probably expect us to also explore these pages in most cases, and should it cause a problem one can customize it by setting the CLI argument).For
--postLoadDelay
, nothing special but add it.@rgaudin @kelson42 any thoughts?
The text was updated successfully, but these errors were encountered: