Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix site]: WSJ gets a captcha since January; no links #486

Open
jeremybmerrill opened this issue Aug 3, 2024 · 5 comments
Open

[Fix site]: WSJ gets a captcha since January; no links #486

jeremybmerrill opened this issue Aug 3, 2024 · 5 comments
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed

Comments

@jeremybmerrill
Copy link
Contributor

Screenshot

wsj

Screenshot via https://palewi.re/docs/news-homepages/sites/wsj.html. AFAIK this has been going since 2024-01-16 16:11:00 (based on the last non-empty links JSON.) I wonder if this is something that could be rectifying (and backfilled) by scraping Internet Archive screenshots, or by asking WSJ to allowlist you. I also wonder if the captcha is part of WSJ's anti-AI-training-data-scraping efforts.

Have you circumvented captchas in other sites?

A solution is often found by adding JavaScript or CSS via a site-specific include, as covered in our documentation.

retaining this boilerplate from the template :)

@jeremybmerrill jeremybmerrill added bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed labels Aug 3, 2024
@palewire
Copy link
Owner

palewire commented Aug 3, 2024

I'm aware of this bug, but I haven't even begun to think about how to fix it. I'm open to patches, and connex to people at WSJ to consult.

@jeremybmerrill
Copy link
Contributor Author

Fair enough. Curious why you're gathering the pages yourself, rather than getting them from the Internet Archive (which appears to have circumvented the WSJ's limitations.)

@palewire
Copy link
Owner

palewire commented Aug 4, 2024

It's probably a longer story than anyone wants to hear. I launched the site in 2012 independent of archive.org as a self-hosted service funded by a Kickstarter campaign. At that time the Wayback Machine was not archiving the homepages of major sites with much frequency. There have been several evolutions since, with the current one hosting assets for free with IA's generous "collections" system.

It would be possible to re-engineer the site to act as a supplement to Wayback's page captures. And perhaps I should move towards such a system. In the 12 years since I started, IA has ramped up its capturing rate for big sites. Though that's not always the case for lower trafficked sites.

@jeremybmerrill
Copy link
Contributor Author

No, that's very useful context, thank you! It feels like the idea I'd be most open to implementing: scraping IA, would be a larger re-engineering of the project that I can't really take on.

@jeremybmerrill
Copy link
Contributor Author

jeremybmerrill commented Aug 15, 2024

Just flagging that Reuters has the same issue as of 2023-12-04 03:59:00

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants