Skip to content

Potential fix for code scanning alert no. 186: Incomplete multi-character sanitization#62

Merged
Jackson57279 merged 1 commit intomainfrom
alert-autofix-186
Aug 13, 2025
Merged

Potential fix for code scanning alert no. 186: Incomplete multi-character sanitization#62
Jackson57279 merged 1 commit intomainfrom
alert-autofix-186

Conversation

@Jackson57279
Copy link
Owner

@Jackson57279 Jackson57279 commented Aug 13, 2025

Potential fix for https://github.com/otdoges/zapdev/security/code-scanning/186

The best way to fix this problem is to use a proper HTML parser to remove <script> and <style> elements and their content, rather than relying on regular expressions. Since cheerio is already imported in the file, we can use it to parse the HTML, remove all <script> and <style> elements, and then extract the text content. This approach is robust against malformed or obfuscated tags and does not rely on repeated regex replacements. The change should be made in the extractTextContent method (lines 318–332), replacing the current implementation with one that uses cheerio to remove the unwanted elements and extract the text.


Suggested fixes powered by Copilot Autofix. Review carefully before merging.

Summary by CodeRabbit

  • New Features
    • Cleaner text displayed from HTML sources by excluding script/style content and normalizing whitespace.
  • Refactor
    • Reworked HTML text extraction to use a DOM-based parsing approach for more consistent results.
    • Simplified internal logic by removing iterative string-based cleanup.
    • No changes to public APIs.

…cter sanitization

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
@vercel
Copy link

vercel bot commented Aug 13, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Project Deployment Preview Comments Updated (UTC)
zapdev Ready Preview Comment Aug 13, 2025 3:38am

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 13, 2025

Walkthrough

Replaced regex-based HTML text extraction in BraveSearchService.extractTextContent with a Cheerio-based DOM parsing approach that removes script/style nodes before extracting and normalizing text. Public APIs and other methods remain unchanged.

Changes

Cohort / File(s) Summary
Search text extraction refactor
src/lib/search-service.ts
Switch from iterative regex/loops to Cheerio: load HTML, remove script/style, read root text, collapse whitespace, trim. Removed previous do-while cleanup and global tag-removal regex. No API signature changes.

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant BraveSearchService
  participant Cheerio

  Caller->>BraveSearchService: extractTextContent(html)
  BraveSearchService->>Cheerio: load(html)
  Cheerio-->>BraveSearchService: DOM
  BraveSearchService->>Cheerio: remove(script, style)
  BraveSearchService->>Cheerio: root().text()
  Cheerio-->>BraveSearchService: textContent
  BraveSearchService-->>Caller: collapsed/trimmed text
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

Poem

A bunny hops through tags and trees,
Snips scripts and styles with gentle ease.
From tangled markup, clear words bloom,
Cheerio whispers, clearing gloom.
One tidy hop, the text is free—
Clean content for you and me. 🐇✨

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch alert-autofix-186

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@Jackson57279 Jackson57279 marked this pull request as ready for review August 13, 2025 03:38
@claude
Copy link

claude bot commented Aug 13, 2025

Claude finished @otdoges's task —— View job


Code Review Complete ✅

Todo List:

  • Read and analyze the modified file
  • Review the security fix implementation
  • Check for potential bugs or edge cases
  • Evaluate performance considerations
  • Assess test coverage needs
  • Provide comprehensive feedback

Overall Assessment: Excellent Security Fix 🔒

This PR successfully addresses CodeQL security alert #186 by replacing a vulnerable regex-based approach with a robust HTML parser solution. The change is well-implemented and significantly improves security.

Code Quality & Best Practices ✅

Strengths:

  • Proper tool usage: Leverages existing cheerio dependency instead of adding new dependencies
  • Clean implementation: Reduces code complexity from 15 lines to 4 lines
  • Consistent style: Follows existing codebase patterns and TypeScript conventions
  • Good documentation: Clear comment explaining the security motivation

Implementation Quality:
The new implementation at src/lib/search-service.ts:318-323 is clean and idiomatic:

private extractTextContent(html: string): string {
  // Use cheerio to robustly remove <script> and <style> tags and extract text
  const $ = cheerio.load(html);
  $('script, style').remove();
  return $.root().text().replace(/\s+/g, ' ').trim();
}

Security Analysis 🛡️

Major Security Improvement:

  • Eliminates incomplete sanitization: The previous regex approach was vulnerable to malformed/obfuscated tags
  • Prevents bypass attacks: HTML parser correctly handles edge cases like <script<!--comment-->type="text/javascript">
  • Robust against variations: Handles self-closing tags, nested elements, and malformed HTML

Security Context:
This method is used in analyzeWebsite() (line 237) to extract text content for analysis, making proper sanitization critical for preventing XSS in downstream processing.

Performance Considerations ⚡

Performance Impact:

  • Trade-off justified: HTML parsing is slightly slower than regex, but the security benefit outweighs the minimal performance cost
  • Efficient approach: cheerio.load() is optimized for server-side HTML parsing
  • Usage context: Called once per website analysis with 15-second timeout, so performance impact is negligible

Memory efficiency: The new approach is actually more memory-efficient as it doesn't require multiple string copies during the loop.

Potential Issues & Edge Cases 🔍

None identified - The implementation handles edge cases well:

  • Malformed HTML: Cheerio's parser handles broken/incomplete tags gracefully
  • Nested elements: Properly removes script/style tags regardless of nesting
  • Large documents: Memory usage is reasonable for typical web content
  • Empty input: Gracefully handles empty or null-like HTML

Test Coverage Assessment 📋

Current state: No tests visible for this specific method in the codebase.

Recommendations for follow-up:

// Suggested test cases (for future PR):
describe('extractTextContent', () => {
  it('should remove script tags and content', () => {
    const html = '<div>Hello <script>alert("xss")</script> World</div>';
    expect(result).toBe('Hello  World');
  });
  
  it('should handle malformed script tags', () => {
    const html = '<div>Text <script<!--comment-->type="text/javascript">evil()</script> More</div>';
    expect(result).toBe('Text  More');
  });
  
  it('should remove style tags and content', () => {
    const html = '<div>Content <style>body{color:red}</style> Text</div>';
    expect(result).toBe('Content  Text');
  });
});

Final Verdict ✅

Approve with confidence - This is a high-quality security fix that:

  • ✅ Completely resolves the CodeQL security alert
  • ✅ Improves code maintainability and readability
  • ✅ Uses appropriate tools and follows best practices
  • ✅ Has minimal performance impact for significant security benefit

No blocking issues found. The implementation is ready for merge.


Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/lib/search-service.ts (2)

319-322: Prefer DOMParser in browsers, fallback to Cheerio to cut bundle size

This module appears to run client-side; Cheerio is heavy for the browser. Consider using DOMParser when available and falling back to Cheerio (Node/SSR). Also exclude noscript/template and extract from body to avoid bringing in head/title duplicates, and normalize NBSP.

Apply this diff:

-    // Use cheerio to robustly remove <script> and <style> tags and extract text
-    const $ = cheerio.load(html);
-    $('script, style').remove();
-    return $.root().text().replace(/\s+/g, ' ').trim();
+    // Prefer DOMParser in browsers; fallback to Cheerio (Node/SSR)
+    try {
+      if (typeof window !== 'undefined' && typeof DOMParser !== 'undefined') {
+        const parser = new DOMParser();
+        const doc = parser.parseFromString(html, 'text/html');
+        doc.querySelectorAll('script, style, noscript, template').forEach((n) => n.remove());
+        const text =
+          (doc.body && doc.body.textContent) ||
+          (doc.documentElement && doc.documentElement.textContent) ||
+          '';
+        return text.replace(/\u00A0/g, ' ').replace(/\s+/g, ' ').trim();
+      }
+    } catch {
+      // Fall through to Cheerio
+    }
+    const $ = cheerio.load(html);
+    $('script, style, noscript, template').remove();
+    const text = $('body').length ? $('body').text() : $.root().text();
+    return text.replace(/\u00A0/g, ' ').replace(/\s+/g, ' ').trim();

319-322: If keeping Cheerio-only: remove noscript/template, use body text, and normalize NBSP

Minor hardening without changing approach:

  • Remove noscript/template (often contain non-UX text).
  • Extract from body to avoid duplicating head/title text.
  • Normalize NBSP before whitespace collapse.
-    const $ = cheerio.load(html);
-    $('script, style').remove();
-    return $.root().text().replace(/\s+/g, ' ').trim();
+    const $ = cheerio.load(html);
+    $('script, style, noscript, template').remove();
+    const text = $('body').length ? $('body').text() : $.root().text();
+    return text.replace(/\u00A0/g, ' ').replace(/\s+/g, ' ').trim();
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2cf4d54 and d7a6463.

📒 Files selected for processing (1)
  • src/lib/search-service.ts (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{ts,tsx}

📄 CodeRabbit Inference Engine (.cursor/rules/convex-security.mdc)

**/*.{ts,tsx}: All Convex queries and mutations MUST use proper authentication. Never accept user IDs from client parameters.
Always verify user owns the data before allowing access.
Use the authenticated user's identity.subject for user references.
Implement proper error messages that don't leak information.
Authentication verification in every function.
Authorization checks for data ownership.
Input validation and sanitization.
Error handling without information leakage.

**/*.{ts,tsx}: Use Sonner for toast notifications to provide consistent user feedback, including success, error, and loading states.
Always handle errors gracefully using try-catch blocks in asynchronous functions, providing user feedback and logging errors.
Provide specific, actionable error messages for form validation errors using toast notifications.
Handle common network error scenarios in catch blocks, providing appropriate toast messages for network errors, authentication errors, and unexpected errors.

If using TypeScript, use an enum to store flag names.

Strict TypeScript must be used with no 'any' types allowed

**/*.{ts,tsx}: NEVER use any type - use proper TypeScript types
Use unknown for truly unknown data types
Implement proper interface definitions
Do not use empty interfaces; use a type alias instead (e.g., type InputProps = ... instead of interface InputProps {})
All function parameters must be typed
All return types should be explicit for public APIs
Use proper generic constraints
Implement discriminated unions for state management
Use proper interface definitions for error handling types (e.g., interface ValidationResult { isValid: boolean; error?: string; })

**/*.{ts,tsx}: Always sanitize user input before storing or displaying using a sanitization function like sanitizeText.
Implement comprehensive input validation, including length checks and detection of malicious patterns, as shown in the validateInput function.
Define and use security constants suc...

Files:

  • src/lib/search-service.ts
**/*.{js,jsx,ts,tsx}

📄 CodeRabbit Inference Engine (.cursor/rules/posthog-integration.mdc)

**/*.{js,jsx,ts,tsx}: Use a consistent naming convention for this storage. enum/const object members should be written UPPERCASE_WITH_UNDERSCORE.
If a custom property for a person or event is at any point referenced in two or more files or two or more callsites in the same file, use an enum or const object, as above in feature flags.

Files:

  • src/lib/search-service.ts
🔇 Additional comments (1)
src/lib/search-service.ts (1)

319-322: Cheerio-based DOM parsing robustly removes script/style — LGTM

Switching from regex to Cheerio parsing eliminates the multi-character sanitization gap and handles malformed/obfuscated tags more reliably. This should satisfy alert #186.

@Jackson57279 Jackson57279 merged commit 786d901 into main Aug 13, 2025
12 checks passed
@Jackson57279 Jackson57279 deleted the alert-autofix-186 branch August 13, 2025 04:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant