-
-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set 'Referer' HTTP request header #2062
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2062 +/- ##
==========================================
+ Coverage 74.38% 74.44% +0.06%
==========================================
Files 41 41
Lines 3146 3146
Branches 689 689
==========================================
+ Hits 2340 2342 +2
+ Misses 686 684 -2
Partials 120 120 ☔ View full report in Codecov by Sentry. |
@audiodude Thank you very much for this simple but very important fix. One thing, why bot just taking the Mediwiki URL given as argument to MWoffliner in place of this fake "http://localhost/"? To me this would more correct and probably more robust. |
Because the WMF software looks for specific values of the |
All WMF domains are part of the Regex look like, so why a hack like "localhost" could be better? |
It just seems better to put a "dummy" value that is obviously a hack, than a WMF domain which is potentially misleading. |
Not really convinced, but I guess you can argue that way. I believe we should anyway have a small test validating this(proper download of map) to secure next time we detect and can fix such kind of issue early. |
@audiodude Maybe we could just extend |
Added test in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
const mwResp = await axios(url, this.arrayBufferRequestOptions) | ||
// The 'Referer' header is set to get around WMF domain origin restrictions. | ||
// See: https://github.com/openzim/mwoffliner/issues/2061 | ||
const mwResp = await axios(url, { ...this.arrayBufferRequestOptions, headers: { Referer: 'https://localhost/' } }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use the actual site/page you are scraping as the referrer instead of this fiction? That would make your scraper function in the same way as any normal user-agent consuming the wiki content and allow you to avoid being seen as deliberately violating hot linking protections used on the Wikimedia content farm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bd808 Was pretty much my proposal, see comments above and responses.
Fixes #2061
Tested with Italian wikipedia download with article listed in bug.