Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
- `--min-wait <seconds>` flag for `session auth` to configure grace period before auth success polling starts (default: 5 seconds, clamped to 0-300)
- `--max-field-length <N>` flag for `extract` macro to configure maximum characters per extracted field (default: 500, max: 2000)
- `--wait-loaded` flag for goto action - waits for async-rendered content to finish loading before taking the snapshot. Combines network idle, DOM stability, and loading indicator absence detection (spinners, skeletons, progress bars, aria-busy). Use `--timeout <ms>` to set wait timeout (default: 15000ms)
- Automatic content blocking detection in goto action - detects when sites serve pages but block content from headless browsers (e.g., X.com empty timelines). Uses provider-specific heuristics (content selectors, blocked indicators) and generic checks (empty content, persistent spinners). Response includes `contentBlocked: true`, `warning: 'content_blocked'`, and recovery suggestions. Disable with `--no-content-block-detect` flag

### Fixed
- Smart default snapshot scoping now includes complementary ARIA landmarks (`<aside>`, `[role="complementary"]`) alongside `<main>`, capturing sidebar content like repository stats (#26)
Expand Down
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ web-ctl session end github

| Action | Usage | Returns |
|--------|-------|---------|
| `goto` | `run <s> goto <url> [--no-auth-wall-detect] [--ensure-auth] [--wait-loaded]` | `{ url, status, authWallDetected, checkpointCompleted, ensureAuthCompleted, waitLoaded, snapshot }` |
| `goto` | `run <s> goto <url> [--no-auth-wall-detect] [--no-content-block-detect] [--ensure-auth] [--wait-loaded]` | `{ url, status, authWallDetected, checkpointCompleted, ensureAuthCompleted, waitLoaded, contentBlocked, warning, snapshot }` |
| `snapshot` | `run <s> snapshot` | `{ url, snapshot }` |
| `click` | `run <s> click <sel> [--wait-stable]` | `{ url, clicked, snapshot }` |
| `click-wait` | `run <s> click-wait <sel> [--timeout]` | `{ url, clicked, settled, snapshot }` |
Expand Down Expand Up @@ -173,6 +173,7 @@ This eliminates the common click-snapshot-check loop that wastes agent turns on
| `--path <file>` | `screenshot` | Custom screenshot path (within session dir) |
| `--allow-evaluate` | `evaluate` | Required safety flag for JS execution |
| `--no-auth-wall-detect` | `goto` | Disable automatic auth wall detection and checkpoint opening |
| `--no-content-block-detect` | `goto` | Disable automatic content blocking detection (e.g., sites serving empty pages to headless browsers) |
| `--ensure-auth` | `goto` | Poll for auth completion instead of timed checkpoint; overrides `--no-auth-wall-detect` |
| `--wait-loaded` | `goto` | Wait for async content to finish rendering (network idle + loading indicator absence + DOM quiet) |
| `--snapshot-depth <N>` | Any action with snapshot | Limit ARIA tree depth (e.g. 3 for top 3 levels) |
Expand Down
175 changes: 172 additions & 3 deletions scripts/auth-wall-detect.js
Original file line number Diff line number Diff line change
@@ -1,14 +1,18 @@
'use strict';

/**
* Auth wall detection module.
* Auth wall and content blocking detection module.
*
* Detects whether a page is showing an authentication wall after navigation.
* detectAuthWall: Detects whether a page is showing an authentication wall.
* Uses three heuristics (ALL must pass - AND logic):
* 1. Domain cookies exist for the target URL
* 2. Current page URL matches a known auth URL pattern
* 3. Page DOM contains login-related elements or text
*
* detectContentBlocked: Detects when a site serves a page but blocks the
* actual content (e.g. X.com serving empty timelines to headless browsers).
* Uses provider-specific and generic heuristics (OR logic - any match triggers).
*
* Short-circuits: if cookie check fails, skips URL and DOM checks.
*/

Expand Down Expand Up @@ -148,4 +152,169 @@ async function detectAuthWall(page, context, targetUrl) {
return { detected: false, reason: 'no_auth_elements' };
}

module.exports = { detectAuthWall, AUTH_URL_PATTERNS, AUTH_DOM_SELECTORS, AUTH_TEXT_PATTERNS };
// --- Content blocking detection ---

const CONTENT_BLOCKED_TEXT_PATTERNS = [
'something went wrong',
'try again',
'content is not available',
'this page is not available',
'page isn\'t available',
'page not found',
'access denied',
'please enable javascript'
];

const LOADING_INDICATOR_SELECTORS = [
'[role="progressbar"]',
'[aria-busy="true"]',
'.spinner',
'.loading'
];

const DEFAULT_EMPTY_CONTENT_THRESHOLD = 200;

/**
* Detect whether the page content is blocked (e.g. by headless browser detection).
*
* Unlike auth wall detection (AND logic), content blocking uses OR logic -
* any single heuristic match triggers detection. Checks are ordered from
* most specific (provider selectors) to most generic (persistent spinners).
*
* @param {import('playwright').Page} page
* @param {object} [options={}]
* @param {string[]} [options.contentSelectors] - Provider-specific content selectors
* @param {object} [options.contentBlockedIndicators] - Provider-specific blocked indicators
* @param {string[]} [options.contentBlockedIndicators.selectors] - Selectors that indicate blocked content
* @param {string[]} [options.contentBlockedIndicators.textPatterns] - Text patterns that indicate blocked content
* @param {number} [options.contentBlockedIndicators.emptyContentThreshold] - Min chars for content to be considered present
* @param {number} [options.timeout] - Not currently used, reserved for future async checks
* @returns {Promise<{ detected: boolean, reason: string, details?: object }>}
*/
async function detectContentBlocked(page, options = {}) {
const { contentSelectors, contentBlockedIndicators } = options;
const blockedSelectors = contentBlockedIndicators?.selectors || [];
const blockedTextPatterns = contentBlockedIndicators?.textPatterns || [];
const emptyThreshold = contentBlockedIndicators?.emptyContentThreshold || DEFAULT_EMPTY_CONTENT_THRESHOLD;

// Fetch body text once, reuse in steps 2 and 4
let bodyText = null;
try {
bodyText = (await page.textContent('body') || '').slice(0, 5000).toLowerCase();
} catch {
// textContent failed - text-based checks will be skipped
}

// 1. Provider-specific blocked selectors
if (blockedSelectors.length > 0) {
try {
const results = await Promise.allSettled(
blockedSelectors.map(async (sel) => ({ sel, el: await page.$(sel) }))
);
for (const r of results) {
if (r.status === 'fulfilled' && r.value.el) {
return {
detected: true,
reason: 'provider_blocked_selector',
details: { selector: r.value.sel }
};
}
}
} catch {
// DOM query failed - continue to next check
}
}

// 2. Provider-specific blocked text patterns
if (blockedTextPatterns.length > 0 && bodyText !== null) {
const matched = blockedTextPatterns.find(pattern => bodyText.includes(pattern));
if (matched) {
return {
detected: true,
reason: 'provider_blocked_text',
details: { pattern: matched }
};
}
}

// 3. Provider content selectors exist but contain very little text
if (contentSelectors && contentSelectors.length > 0) {
try {
let totalContentLength = 0;
let anyContentSelectorFound = false;

const results = await Promise.allSettled(
contentSelectors.map(async (sel) => {
const el = await page.$(sel);
if (!el) return { sel, found: false, length: 0 };
const text = await el.textContent() || '';
return { sel, found: true, length: text.trim().length };
})
);

for (const r of results) {
if (r.status === 'fulfilled' && r.value.found) {
anyContentSelectorFound = true;
totalContentLength += r.value.length;
}
}

if (anyContentSelectorFound && totalContentLength < emptyThreshold) {
return {
detected: true,
reason: 'content_empty',
details: { contentLength: totalContentLength, threshold: emptyThreshold }
};
}
} catch {
// DOM query failed - continue to next check
}
}

// 4. Generic text patterns + short main content area
if (bodyText !== null) {
const genericMatch = CONTENT_BLOCKED_TEXT_PATTERNS.find(pattern => bodyText.includes(pattern));
if (genericMatch && bodyText.length < 500) {
return {
detected: true,
reason: 'generic_blocked_text',
details: { pattern: genericMatch, bodyLength: bodyText.length }
};
}
}

// 5. Persistent loading indicators (spinners still visible)
try {
const results = await Promise.allSettled(
LOADING_INDICATOR_SELECTORS.map(async (sel) => {
const el = await page.$(sel);
if (!el) return { sel, visible: false };
const visible = await el.isVisible();
return { sel, visible };
})
);
for (const r of results) {
if (r.status === 'fulfilled' && r.value.visible) {
return {
detected: true,
reason: 'persistent_loader',
details: { selector: r.value.sel }
};
}
}
} catch {
// DOM query failed - no detection
}

return { detected: false, reason: 'content_ok' };
}

module.exports = {
detectAuthWall,
detectContentBlocked,
AUTH_URL_PATTERNS,
AUTH_DOM_SELECTORS,
AUTH_TEXT_PATTERNS,
CONTENT_BLOCKED_TEXT_PATTERNS,
LOADING_INDICATOR_SELECTORS
};
37 changes: 37 additions & 0 deletions scripts/browser-launcher.js
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,44 @@ async function launchBrowser(sessionName, options = {}) {

// Anti-bot init script on all pages
await context.addInitScript(() => {
// Hide webdriver flag
Object.defineProperty(navigator, 'webdriver', { get: () => false });

// Spoof window.chrome object (present in real Chrome, missing in headless)
if (!window.chrome) {
window.chrome = { runtime: {}, csi: function() {}, loadTimes: function() {} };
}

// Spoof navigator.plugins to appear non-empty (headless has empty PluginArray)
Object.defineProperty(navigator, 'plugins', {
get: () => {
const arr = [{ name: 'Chrome PDF Plugin', filename: 'internal-pdf-viewer', description: 'Portable Document Format' }];
arr.item = (i) => arr[i];
arr.namedItem = (n) => arr.find(p => p.name === n) || null;
arr.refresh = () => {};
return arr;
}
});

// Set navigator.languages (headless may report empty or single-entry)
Object.defineProperty(navigator, 'languages', { get: () => ['en-US', 'en'] });

// Override WebGL renderer to common hardware (Intel Iris)
const getParameter = WebGLRenderingContext.prototype.getParameter;
WebGLRenderingContext.prototype.getParameter = function(param) {
if (param === 0x9245) return 'Intel Inc.'; // UNMASKED_VENDOR_WEBGL
if (param === 0x9246) return 'Intel Iris OpenGL Engine'; // UNMASKED_RENDERER_WEBGL
return getParameter.call(this, param);
};

// Override permissions.query for 'notifications' (headless returns 'prompt')
const origQuery = navigator.permissions.query.bind(navigator.permissions);
navigator.permissions.query = (params) => {
if (params.name === 'notifications') {
return Promise.resolve({ state: 'denied', onchange: null });
}
return origQuery(params);
};
});

// Get or create the first page
Expand Down
8 changes: 7 additions & 1 deletion scripts/providers.json
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,14 @@
"captchaTextPatterns": ["verify it's you", "complete the challenge", "confirm your identity"],
"twoFactorHint": "X may prompt for a TOTP code, SMS code, or security key. X may also ask you to confirm your username or phone number as an identity check before password.",
"twoFactorSelectors": ["input[data-testid=\"ocfEnterTextTextInput\"]", "input[name=\"text\"][autocomplete=\"one-time-code\"]"],
"contentSelectors": ["[data-testid=\"primaryColumn\"]", "article[data-testid=\"tweet\"]", "[data-testid=\"cellInnerDiv\"]"],
"contentBlockedIndicators": {
"selectors": ["[data-testid=\"empty_state_header_text\"]", "[data-testid=\"error-detail\"]"],
"textPatterns": ["something went wrong", "try again", "content is not available", "this page is not available"],
"emptyContentThreshold": 200
},
"flowType": "spa",
"notes": "Multi-step SPA flow: identifier -> optional username challenge -> password -> optional 2FA. URL does not change between steps. Arkose FunCAPTCHA triggers on nearly every new IP. Most aggressive anti-bot of any provider."
"notes": "Multi-step SPA flow: identifier -> optional username challenge -> password -> optional 2FA. URL does not change between steps. Arkose FunCAPTCHA triggers on nearly every new IP. Most aggressive anti-bot of any provider. X.com blocks headless browsers from viewing feed content - pages load but timelines appear empty or show error states. Authenticated sessions with real browser profiles work reliably."
},
{
"slug": "reddit",
Expand Down
58 changes: 55 additions & 3 deletions scripts/web-ctl.js
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

const sessionStore = require('./session-store');
const { launchBrowser, closeBrowser, randomDelay, waitForStable, waitForLoaded, canLaunchHeaded } = require('./browser-launcher');
const { detectAuthWall } = require('./auth-wall-detect');
const { detectAuthWall, detectContentBlocked } = require('./auth-wall-detect');
const { runAuthFlow } = require('./auth-flow');
const { checkAuthSuccess } = require('./auth-check');
const { sanitizeWebContent, wrapOutput } = require('./redact');
Expand All @@ -21,7 +21,7 @@ const BOOLEAN_FLAGS = new Set([
'--allow-evaluate', '--no-snapshot', '--wait-stable', '--vnc',
'--exact', '--accept', '--submit', '--dismiss', '--auto',
'--snapshot-collapse', '--snapshot-text-only', '--snapshot-compact',
'--snapshot-full', '--no-auth-wall-detect', '--ensure-auth', '--wait-loaded',
'--snapshot-full', '--no-auth-wall-detect', '--no-content-block-detect', '--ensure-auth', '--wait-loaded',
]);

function validateSessionName(name) {
Expand Down Expand Up @@ -57,6 +57,37 @@ function parseOptions(args) {
return opts;
}

/**
* Match a provider by domain from providers.json.
* Uses a lazy-loaded Map keyed by domain for O(1) lookup.
*/
let _providerDomainMap = null;
function matchProviderByDomain(url) {
if (!_providerDomainMap) {
_providerDomainMap = new Map();
try {
const providers = require('./providers.json');
for (const p of providers) {
try {
const domain = new URL(p.loginUrl).hostname;
_providerDomainMap.set(domain, p);
} catch {
// Skip provider with invalid loginUrl
}
}
} catch {
// providers.json load failed - return null for all lookups
}
}

try {
const domain = new URL(url).hostname;
return _providerDomainMap.get(domain) || null;
} catch {
return null;
}
}

/**
* Convert selector string to Playwright locator.
*/
Expand Down Expand Up @@ -1044,8 +1075,28 @@ async function runAction(sessionName, action, actionArgs, opts) {
if (opts.waitLoaded) {
await waitForLoaded(page, { timeout: loadedTimeout });
}
// Content blocking detection (e.g. X.com empty feeds in headless)
let contentBlockResult = null;
if (!opts.noContentBlockDetect) {
const provider = matchProviderByDomain(url);
contentBlockResult = await detectContentBlocked(page, {
contentSelectors: provider?.contentSelectors,
contentBlockedIndicators: provider?.contentBlockedIndicators
});
}
const snapshot = await getSnapshot(page, opts);
result = { url: page.url(), status: response ? response.status() : null, ...(opts.waitLoaded && { waitLoaded: true }), ...(snapshot != null && { snapshot }) };
result = {
url: page.url(),
status: response ? response.status() : null,
...(opts.waitLoaded && { waitLoaded: true }),
...(contentBlockResult?.detected && {
contentBlocked: true,
warning: 'content_blocked',
contentBlockedReason: contentBlockResult.reason,
suggestion: "Site may be blocking headless browsers. Try: (1) authenticate with 'session auth <name> --provider <provider>', (2) use --ensure-auth for headed mode"
}),
...(snapshot != null && { snapshot })
};
break;
}

Expand Down Expand Up @@ -1270,6 +1321,7 @@ Run actions:
goto <url> Navigate to URL
[--ensure-auth] Poll for auth completion instead of timed checkpoint
[--wait-loaded] Wait for async content to finish rendering
[--no-content-block-detect] Skip content blocking detection
[--timeout <ms>] Wait timeout (default: 15000)
snapshot Get accessibility tree
click <selector> Click element
Expand Down
6 changes: 4 additions & 2 deletions skills/web-browse/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ Safe practice: always double-quote URL arguments.
### goto - Navigate to URL

```bash
node ${PLUGIN_ROOT}/scripts/web-ctl.js run <session> goto <url> [--no-auth-wall-detect] [--ensure-auth] [--wait-loaded]
node ${PLUGIN_ROOT}/scripts/web-ctl.js run <session> goto <url> [--no-auth-wall-detect] [--no-content-block-detect] [--ensure-auth] [--wait-loaded]
```

Navigates to a URL and automatically detects authentication walls using a three-heuristic detection system:
Expand All @@ -62,7 +62,9 @@ Use `--ensure-auth` to actively poll for authentication completion instead of a

Use `--wait-loaded` to wait for async-rendered content to finish loading before taking the snapshot. This combines network idle, DOM stability, loading indicator absence detection (spinners, skeletons, progress bars, aria-busy), and a final DOM quiet period. Use `--timeout <ms>` to set the wait timeout (default: 15000ms). Ideal for SPAs and pages that render content after the initial page load.

Returns: `{ url, status, authWallDetected, checkpointCompleted, ensureAuthCompleted, waitLoaded, snapshot }`
Use `--no-content-block-detect` to disable automatic detection of content blocking (e.g., sites serving empty pages to headless browsers). When content blocking is detected, the response includes `contentBlocked: true`, `warning: 'content_blocked'`, and a suggestion to authenticate or use headed mode.

Returns: `{ url, status, authWallDetected, checkpointCompleted, ensureAuthCompleted, waitLoaded, contentBlocked, warning, contentBlockedReason, suggestion, snapshot }`

### snapshot - Get Accessibility Tree

Expand Down
Loading
Loading