How to bypass aggressive forced CAPTCHA on a dynamic academic repository (CUHK)?

Hi everyone,

I am building an automated scraper for an academic repository (https://repository.lib.cuhk.edu.hk/en/collection/etd/year), which structures resources by year. However, we've hit a hard block.

Here is the situation:
1. **Aggressive CAPTCHA:** The site uses a mandatory CAPTCHA (image verification) system that triggers for almost every visit, regardless of whether it's a real human browser or a bot.
2. **AI/Agent Failure:** Our automated agents attempted to solve the image verification multiple times but were consistently blocked.
3. **JS-Heavy:** The site is heavily reliant on JavaScript to load the actual item lists, making simple HTTP requests (`requests`/`curl`) useless.

**My Questions:**
1. What is the state-of-the-art stack for bypassing such aggressive image CAPTCHAs in a fully automated Python pipeline today? (e.g., combining `undetected-chromedriver` with a specific solver service?)
2. Since this is an academic repository, it likely supports OAI-PMH. Does anyone have experience bypassing the frontend completely by finding standard API endpoints (like `/oai/request`) on similar library systems?
3. Any specific configuration recommendations for browser automation to reduce the CAPTCHA difficulty/frequency on this specific type of site?

Thanks in advance for any insights!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to bypass aggressive forced CAPTCHA on a dynamic academic repository (CUHK)? #329

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

How to bypass aggressive forced CAPTCHA on a dynamic academic repository (CUHK)? #329

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions