Skip to content

How to bypass aggressive forced CAPTCHA on a dynamic academic repository (CUHK)? #329

@HarideP

Description

@HarideP

Hi everyone,

I am building an automated scraper for an academic repository (https://repository.lib.cuhk.edu.hk/en/collection/etd/year), which structures resources by year. However, we've hit a hard block.

Here is the situation:

  1. Aggressive CAPTCHA: The site uses a mandatory CAPTCHA (image verification) system that triggers for almost every visit, regardless of whether it's a real human browser or a bot.
  2. AI/Agent Failure: Our automated agents attempted to solve the image verification multiple times but were consistently blocked.
  3. JS-Heavy: The site is heavily reliant on JavaScript to load the actual item lists, making simple HTTP requests (requests/curl) useless.

My Questions:

  1. What is the state-of-the-art stack for bypassing such aggressive image CAPTCHAs in a fully automated Python pipeline today? (e.g., combining undetected-chromedriver with a specific solver service?)
  2. Since this is an academic repository, it likely supports OAI-PMH. Does anyone have experience bypassing the frontend completely by finding standard API endpoints (like /oai/request) on similar library systems?
  3. Any specific configuration recommendations for browser automation to reduce the CAPTCHA difficulty/frequency on this specific type of site?

Thanks in advance for any insights!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions