Potential fix for code scanning alert no. 186: Incomplete multi-character sanitization by Jackson57279 · Pull Request #62 · Jackson57279/zapdev

Jackson57279 · 2025-08-13T03:38:03Z

Potential fix for https://github.com/otdoges/zapdev/security/code-scanning/186

The best way to fix this problem is to use a proper HTML parser to remove <script> and <style> elements and their content, rather than relying on regular expressions. Since cheerio is already imported in the file, we can use it to parse the HTML, remove all <script> and <style> elements, and then extract the text content. This approach is robust against malformed or obfuscated tags and does not rely on repeated regex replacements. The change should be made in the extractTextContent method (lines 318–332), replacing the current implementation with one that uses cheerio to remove the unwanted elements and extract the text.

Suggested fixes powered by Copilot Autofix. Review carefully before merging.

Summary by CodeRabbit

New Features
- Cleaner text displayed from HTML sources by excluding script/style content and normalizing whitespace.
Refactor
- Reworked HTML text extraction to use a DOM-based parsing approach for more consistent results.
- Simplified internal logic by removing iterative string-based cleanup.
- No changes to public APIs.

…cter sanitization Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

vercel · 2025-08-13T03:38:06Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Project	Deployment	Preview	Comments	Updated (UTC)
zapdev	✅ Ready	Preview	Comment	Aug 13, 2025 3:38am

coderabbitai · 2025-08-13T03:38:11Z

Walkthrough

Replaced regex-based HTML text extraction in BraveSearchService.extractTextContent with a Cheerio-based DOM parsing approach that removes script/style nodes before extracting and normalizing text. Public APIs and other methods remain unchanged.

Changes

Cohort / File(s)	Summary
Search text extraction refactor `src/lib/search-service.ts`	Switch from iterative regex/loops to Cheerio: load HTML, remove script/style, read root text, collapse whitespace, trim. Removed previous do-while cleanup and global tag-removal regex. No API signature changes.

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant BraveSearchService
  participant Cheerio

  Caller->>BraveSearchService: extractTextContent(html)
  BraveSearchService->>Cheerio: load(html)
  Cheerio-->>BraveSearchService: DOM
  BraveSearchService->>Cheerio: remove(script, style)
  BraveSearchService->>Cheerio: root().text()
  Cheerio-->>BraveSearchService: textContent
  BraveSearchService-->>Caller: collapsed/trimmed text

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

Potential fix for code scanning alert no. 120: Bad HTML filtering regexp #56: Also replaces extractTextContent’s regex approach with a DOM-based parser (DOMParser + fallback), aligning with the same function change.
Potential fix for code scanning alert no. 121: Incomplete multi-character sanitization #55: Modifies extractTextContent to alter script/style removal and text extraction in the same file/function.
Potential fix for code scanning alert no. 122: Incomplete multi-character sanitization #53: Targets extractTextContent to improve script/style stripping prior to text extraction.

Poem

A bunny hops through tags and trees,
Snips scripts and styles with gentle ease.
From tangled markup, clear words bloom,
Cheerio whispers, clearing gloom.
One tidy hop, the text is free—
Clean content for you and me. 🐇✨

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch alert-autofix-186

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

claude · 2025-08-13T03:38:15Z

Claude finished @otdoges's task —— View job

Code Review Complete ✅

Todo List:

Read and analyze the modified file
Review the security fix implementation
Check for potential bugs or edge cases
Evaluate performance considerations
Assess test coverage needs
Provide comprehensive feedback

Overall Assessment: Excellent Security Fix 🔒

This PR successfully addresses CodeQL security alert #186 by replacing a vulnerable regex-based approach with a robust HTML parser solution. The change is well-implemented and significantly improves security.

Code Quality & Best Practices ✅

Strengths:

Proper tool usage: Leverages existing cheerio dependency instead of adding new dependencies
Clean implementation: Reduces code complexity from 15 lines to 4 lines
Consistent style: Follows existing codebase patterns and TypeScript conventions
Good documentation: Clear comment explaining the security motivation

Implementation Quality:
The new implementation at src/lib/search-service.ts:318-323 is clean and idiomatic:

private extractTextContent(html: string): string {
  // Use cheerio to robustly remove <script> and <style> tags and extract text
  const $ = cheerio.load(html);
  $('script, style').remove();
  return $.root().text().replace(/\s+/g, ' ').trim();
}

Security Analysis 🛡️

Major Security Improvement:

Eliminates incomplete sanitization: The previous regex approach was vulnerable to malformed/obfuscated tags
Prevents bypass attacks: HTML parser correctly handles edge cases like <scripttype="text/javascript">
Robust against variations: Handles self-closing tags, nested elements, and malformed HTML

Security Context:
This method is used in analyzeWebsite() (line 237) to extract text content for analysis, making proper sanitization critical for preventing XSS in downstream processing.

Performance Considerations ⚡

Performance Impact:

Trade-off justified: HTML parsing is slightly slower than regex, but the security benefit outweighs the minimal performance cost
Efficient approach: cheerio.load() is optimized for server-side HTML parsing
Usage context: Called once per website analysis with 15-second timeout, so performance impact is negligible

Memory efficiency: The new approach is actually more memory-efficient as it doesn't require multiple string copies during the loop.

Potential Issues & Edge Cases 🔍

None identified - The implementation handles edge cases well:

Malformed HTML: Cheerio's parser handles broken/incomplete tags gracefully
Nested elements: Properly removes script/style tags regardless of nesting
Large documents: Memory usage is reasonable for typical web content
Empty input: Gracefully handles empty or null-like HTML

Test Coverage Assessment 📋

Current state: No tests visible for this specific method in the codebase.

Recommendations for follow-up:

// Suggested test cases (for future PR):
describe('extractTextContent', () => {
  it('should remove script tags and content', () => {
    const html = '<div>Hello <script>alert("xss")</script> World</div>';
    expect(result).toBe('Hello  World');
  });
  
  it('should handle malformed script tags', () => {
    const html = '<div>Text <script<!--comment-->type="text/javascript">evil()</script> More</div>';
    expect(result).toBe('Text  More');
  });
  
  it('should remove style tags and content', () => {
    const html = '<div>Content <style>body{color:red}</style> Text</div>';
    expect(result).toBe('Content  Text');
  });
});

Final Verdict ✅

Approve with confidence - This is a high-quality security fix that:

✅ Completely resolves the CodeQL security alert
✅ Improves code maintainability and readability
✅ Uses appropriate tools and follows best practices
✅ Has minimal performance impact for significant security benefit

No blocking issues found. The implementation is ready for merge.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

src/lib/search-service.ts (2)

319-322: Prefer DOMParser in browsers, fallback to Cheerio to cut bundle size

This module appears to run client-side; Cheerio is heavy for the browser. Consider using DOMParser when available and falling back to Cheerio (Node/SSR). Also exclude noscript/template and extract from body to avoid bringing in head/title duplicates, and normalize NBSP.

Apply this diff:

-    // Use cheerio to robustly remove <script> and <style> tags and extract text
-    const $ = cheerio.load(html);
-    $('script, style').remove();
-    return $.root().text().replace(/\s+/g, ' ').trim();
+    // Prefer DOMParser in browsers; fallback to Cheerio (Node/SSR)
+    try {
+      if (typeof window !== 'undefined' && typeof DOMParser !== 'undefined') {
+        const parser = new DOMParser();
+        const doc = parser.parseFromString(html, 'text/html');
+        doc.querySelectorAll('script, style, noscript, template').forEach((n) => n.remove());
+        const text =
+          (doc.body && doc.body.textContent) ||
+          (doc.documentElement && doc.documentElement.textContent) ||
+          '';
+        return text.replace(/\u00A0/g, ' ').replace(/\s+/g, ' ').trim();
+      }
+    } catch {
+      // Fall through to Cheerio
+    }
+    const $ = cheerio.load(html);
+    $('script, style, noscript, template').remove();
+    const text = $('body').length ? $('body').text() : $.root().text();
+    return text.replace(/\u00A0/g, ' ').replace(/\s+/g, ' ').trim();

319-322: If keeping Cheerio-only: remove noscript/template, use body text, and normalize NBSP

Minor hardening without changing approach:

Remove noscript/template (often contain non-UX text).
Extract from body to avoid duplicating head/title text.
Normalize NBSP before whitespace collapse.

-    const $ = cheerio.load(html);
-    $('script, style').remove();
-    return $.root().text().replace(/\s+/g, ' ').trim();
+    const $ = cheerio.load(html);
+    $('script, style, noscript, template').remove();
+    const text = $('body').length ? $('body').text() : $.root().text();
+    return text.replace(/\u00A0/g, ' ').replace(/\s+/g, ' ').trim();

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2cf4d54 and d7a6463.

📒 Files selected for processing (1)

src/lib/search-service.ts (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.{ts,tsx}

📄 CodeRabbit Inference Engine (.cursor/rules/convex-security.mdc)

**/*.{ts,tsx}: All Convex queries and mutations MUST use proper authentication. Never accept user IDs from client parameters.
Always verify user owns the data before allowing access.
Use the authenticated user's identity.subject for user references.
Implement proper error messages that don't leak information.
Authentication verification in every function.
Authorization checks for data ownership.
Input validation and sanitization.
Error handling without information leakage.

**/*.{ts,tsx}: Use Sonner for toast notifications to provide consistent user feedback, including success, error, and loading states.
Always handle errors gracefully using try-catch blocks in asynchronous functions, providing user feedback and logging errors.
Provide specific, actionable error messages for form validation errors using toast notifications.
Handle common network error scenarios in catch blocks, providing appropriate toast messages for network errors, authentication errors, and unexpected errors.

If using TypeScript, use an enum to store flag names.

Strict TypeScript must be used with no 'any' types allowed

**/*.{ts,tsx}: NEVER use any type - use proper TypeScript types
Use unknown for truly unknown data types
Implement proper interface definitions
Do not use empty interfaces; use a type alias instead (e.g., type InputProps = ... instead of interface InputProps {})
All function parameters must be typed
All return types should be explicit for public APIs
Use proper generic constraints
Implement discriminated unions for state management
Use proper interface definitions for error handling types (e.g., interface ValidationResult { isValid: boolean; error?: string; })

**/*.{ts,tsx}: Always sanitize user input before storing or displaying using a sanitization function like sanitizeText.
Implement comprehensive input validation, including length checks and detection of malicious patterns, as shown in the validateInput function.
Define and use security constants suc...

Files:

src/lib/search-service.ts

**/*.{js,jsx,ts,tsx}

📄 CodeRabbit Inference Engine (.cursor/rules/posthog-integration.mdc)

**/*.{js,jsx,ts,tsx}: Use a consistent naming convention for this storage. enum/const object members should be written UPPERCASE_WITH_UNDERSCORE.
If a custom property for a person or event is at any point referenced in two or more files or two or more callsites in the same file, use an enum or const object, as above in feature flags.

Files:

src/lib/search-service.ts

🔇 Additional comments (1)

src/lib/search-service.ts (1)

319-322: Cheerio-based DOM parsing robustly removes script/style — LGTM

Switching from regex to Cheerio parsing eliminates the multi-character sanitization gap and handles malformed/obfuscated tags more reliably. This should satisfy alert #186.

Potential fix for code scanning alert no. 186: Incomplete multi-chara…

d7a6463

…cter sanitization Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Jackson57279 marked this pull request as ready for review August 13, 2025 03:38

vercel bot deployed to Preview August 13, 2025 03:38 View deployment

coderabbitai bot reviewed Aug 13, 2025

View reviewed changes

Jackson57279 merged commit 786d901 into main Aug 13, 2025
12 checks passed

Jackson57279 deleted the alert-autofix-186 branch August 13, 2025 04:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential fix for code scanning alert no. 186: Incomplete multi-character sanitization#62

Potential fix for code scanning alert no. 186: Incomplete multi-character sanitization#62
Jackson57279 merged 1 commit intomainfrom
alert-autofix-186

Jackson57279 commented Aug 13, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

vercel bot commented Aug 13, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Aug 13, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Status, Documentation and Community

Uh oh!

claude bot commented Aug 13, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Jackson57279 commented Aug 13, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

vercel bot commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

claude bot commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete ✅

Todo List:

Overall Assessment: Excellent Security Fix 🔒

Code Quality & Best Practices ✅

Security Analysis 🛡️

Performance Considerations ⚡

Potential Issues & Edge Cases 🔍

Test Coverage Assessment 📋

Final Verdict ✅

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Jackson57279 commented Aug 13, 2025 •

edited by coderabbitai bot

Loading

vercel bot commented Aug 13, 2025 •

edited

Loading

coderabbitai bot commented Aug 13, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

claude bot commented Aug 13, 2025 •

edited

Loading