Skip to content

X.com feed blocks headless browsers - use as test case for auth wall detection #38

@avifenesh

Description

@avifenesh

Context

X.com (Twitter) is one of the most aggressive sites at detecting and blocking automated/headless browser access. This makes it an ideal test case for building robust auth wall and content-blocking detection (related to #34).

Observed Behavior

  1. goto https://x.com - loads page with status 200 but empty snapshot
  2. checkpoint (headed mode) - user is logged in, sidebar renders, URL changes to /home
  3. Back in headless mode - page shows infinite loading spinner
  4. Screenshot confirms: sidebar with nav items loads initially, but feed content never renders
  5. evaluate "document.querySelectorAll('article').length" returns 0
  6. document.body.innerText.length returns 0
  7. After a few attempts, even the sidebar disappears - just a blank page with spinner

Why This Matters

X.com represents the hardest category of sites to automate. If web-ctl can handle X, it can handle anything. This issue tracks making X.com a first-class test case for:

  1. Content blocking detection - Page chrome loads but main content area stays as a spinner. Web-ctl should detect this pattern and surface "warning": "content_blocked" rather than returning empty snapshots silently.

  2. Headless detection evasion - Research what signals X uses to detect headless browsers (navigator.webdriver, CDP detection, canvas fingerprinting, etc.) and whether Playwright stealth techniques can help.

  3. Session persistence - Cookies from headed checkpoint don't seem to carry sufficient auth state back to headless mode. X may be checking for properties that only exist in headed contexts.

  4. Feed API approach - Investigate whether intercepting X's internal API calls (via network capture) could be a more reliable path than DOM scraping.

Acceptance Criteria

  • web-ctl can navigate to x.com/home and read feed content in headless mode
  • If blocked, web-ctl surfaces a clear warning within 10 seconds (not infinite spinner)
  • Document the techniques that work/don't work for reference

Labels

This is a research/learning issue - treat it as a special use case for hardening web-ctl against aggressive anti-bot sites.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions