-
-
Notifications
You must be signed in to change notification settings - Fork 631
How to bypass aggressive forced CAPTCHA on a dynamic academic repository (CUHK)? #329
Copy link
Copy link
Open
Description
Hi everyone,
I am building an automated scraper for an academic repository (https://repository.lib.cuhk.edu.hk/en/collection/etd/year), which structures resources by year. However, we've hit a hard block.
Here is the situation:
- Aggressive CAPTCHA: The site uses a mandatory CAPTCHA (image verification) system that triggers for almost every visit, regardless of whether it's a real human browser or a bot.
- AI/Agent Failure: Our automated agents attempted to solve the image verification multiple times but were consistently blocked.
- JS-Heavy: The site is heavily reliant on JavaScript to load the actual item lists, making simple HTTP requests (
requests/curl) useless.
My Questions:
- What is the state-of-the-art stack for bypassing such aggressive image CAPTCHAs in a fully automated Python pipeline today? (e.g., combining
undetected-chromedriverwith a specific solver service?) - Since this is an academic repository, it likely supports OAI-PMH. Does anyone have experience bypassing the frontend completely by finding standard API endpoints (like
/oai/request) on similar library systems? - Any specific configuration recommendations for browser automation to reduce the CAPTCHA difficulty/frequency on this specific type of site?
Thanks in advance for any insights!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels