[Fix site]: WSJ gets a captcha since January; no links #486

jeremybmerrill · 2024-08-03T20:00:48Z

Screenshot

Screenshot via https://palewi.re/docs/news-homepages/sites/wsj.html. AFAIK this has been going since 2024-01-16 16:11:00 (based on the last non-empty links JSON.) I wonder if this is something that could be rectifying (and backfilled) by scraping Internet Archive screenshots, or by asking WSJ to allowlist you. I also wonder if the captcha is part of WSJ's anti-AI-training-data-scraping efforts.

Have you circumvented captchas in other sites?

A solution is often found by adding JavaScript or CSS via a site-specific include, as covered in our documentation.

retaining this boilerplate from the template :)

palewire · 2024-08-03T22:52:45Z

I'm aware of this bug, but I haven't even begun to think about how to fix it. I'm open to patches, and connex to people at WSJ to consult.

jeremybmerrill · 2024-08-04T00:30:59Z

Fair enough. Curious why you're gathering the pages yourself, rather than getting them from the Internet Archive (which appears to have circumvented the WSJ's limitations.)

palewire · 2024-08-04T11:13:40Z

It's probably a longer story than anyone wants to hear. I launched the site in 2012 independent of archive.org as a self-hosted service funded by a Kickstarter campaign. At that time the Wayback Machine was not archiving the homepages of major sites with much frequency. There have been several evolutions since, with the current one hosting assets for free with IA's generous "collections" system.

It would be possible to re-engineer the site to act as a supplement to Wayback's page captures. And perhaps I should move towards such a system. In the 12 years since I started, IA has ramped up its capturing rate for big sites. Though that's not always the case for lower trafficked sites.

jeremybmerrill · 2024-08-04T16:40:57Z

No, that's very useful context, thank you! It feels like the idea I'd be most open to implementing: scraping IA, would be a larger re-engineering of the project that I can't really take on.

jeremybmerrill · 2024-08-15T01:30:31Z

Just flagging that Reuters has the same issue as of 2023-12-04 03:59:00

jeremybmerrill added bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed labels Aug 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix site]: WSJ gets a captcha since January; no links #486

[Fix site]: WSJ gets a captcha since January; no links #486

jeremybmerrill commented Aug 3, 2024

palewire commented Aug 3, 2024

jeremybmerrill commented Aug 4, 2024

palewire commented Aug 4, 2024 •

edited

Loading

jeremybmerrill commented Aug 4, 2024

jeremybmerrill commented Aug 15, 2024 •

edited

Loading

[Fix site]: WSJ gets a captcha since January; no links #486

[Fix site]: WSJ gets a captcha since January; no links #486

Comments

jeremybmerrill commented Aug 3, 2024

Screenshot

palewire commented Aug 3, 2024

jeremybmerrill commented Aug 4, 2024

palewire commented Aug 4, 2024 • edited Loading

jeremybmerrill commented Aug 4, 2024

jeremybmerrill commented Aug 15, 2024 • edited Loading

palewire commented Aug 4, 2024 •

edited

Loading

jeremybmerrill commented Aug 15, 2024 •

edited

Loading