Skip to content

Potential fix for code scanning alert no. 122: Incomplete multi-character sanitization#54

Merged
Jackson57279 merged 2 commits intomainfrom
alert-autofix-155
Aug 12, 2025
Merged

Potential fix for code scanning alert no. 122: Incomplete multi-character sanitization#54
Jackson57279 merged 2 commits intomainfrom
alert-autofix-155

Conversation

@Jackson57279
Copy link
Owner

@Jackson57279 Jackson57279 commented Aug 12, 2025

Potential fix for https://github.com/otdoges/zapdev/security/code-scanning/122

The best way to fix this problem is to use a well-tested HTML sanitization library, such as sanitize-html, to reliably remove all <script> and <style> tags and their content, as well as any other potentially dangerous HTML. However, if adding a dependency is not an option, a safer alternative is to repeatedly apply the regular expression replacements for <script> and <style> tags until no more matches are found. This ensures that all instances, including those that may appear after previous replacements, are removed.

To implement this fix in src/lib/search-service.ts, update the extractTextContent method (lines 318-325) to repeatedly remove <script> and <style> tags and their content until none remain, before proceeding to strip all other tags and whitespace. No new imports are needed for the repeated replacement approach.


Suggested fixes powered by Copilot Autofix. Review carefully before merging.

Summary by CodeRabbit

  • Bug Fixes

    • Improved search result text extraction for cleaner, safer snippets by robustly removing scripts, styles, and stray HTML.
    • Reduced noise and formatting artifacts in search previews for better readability.
  • Chores

    • Added an HTML sanitization library as a runtime dependency to support safer content processing.
    • Updated dependency formatting in configuration to accommodate the new library.

…cter sanitization

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
@vercel
Copy link

vercel bot commented Aug 12, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Project Deployment Preview Comments Updated (UTC)
zapdev Failed Aug 12, 2025 6:50pm

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 12, 2025

Warning

Rate limit exceeded

@otdoges has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 16 minutes and 33 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 48cd5d9 and 56fd959.

📒 Files selected for processing (2)
  • package.json (1 hunks)
  • src/lib/search-service.ts (2 hunks)

Walkthrough

Introduces sanitize-html as a new dependency and updates src/lib/search-service.ts to use it for HTML text extraction, replacing prior regex-based stripping. No public APIs are changed; only internal sanitization logic is updated.

Changes

Cohort / File(s) Summary
Dependencies
package.json
Added dependency sanitize-html@^2.17.0. Adjusted formatting to include trailing comma after zod. No other dependency changes.
Search Service Sanitization
src/lib/search-service.ts
Imported sanitize-html. Reworked extractTextContent(html: string) to sanitize HTML (discarding disallowed tags, excluding script/style and their content) and then strip remaining tags and normalize whitespace. No public API changes.

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant SearchService
  participant SanitizeHtml

  Caller->>SearchService: extractTextContent(html)
  SearchService->>SanitizeHtml: sanitize(html, options)
  SanitizeHtml-->>SearchService: sanitizedHtml
  SearchService->>SearchService: strip tags, collapse whitespace, trim
  SearchService-->>Caller: textContent
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

In fields of tags I twitch my nose,
Scripts and styles—poof! There it goes.
With sanitized hops through HTML dew,
I gather clean words, crisp and new.
Thump-thump, commit—dependency set,
A tidy burrow of text to net. 🐇✨

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch alert-autofix-155

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@claude
Copy link

claude bot commented Aug 12, 2025

Claude finished @otdoges's task —— View job


Code Review: Security Fix for Incomplete Multi-Character Sanitization

Todo List:

  • Read and analyze the modified search-service.ts file
  • Review the extractTextContent method implementation
  • Evaluate security fix effectiveness
  • Check for potential edge cases and improvements
  • Assess code quality and best practices
  • Provide comprehensive feedback

Review Summary

This PR addresses a legitimate security vulnerability (incomplete multi-character sanitization) by implementing a loop-based approach to repeatedly remove <script> and <style> tags until no more instances are found.

Positive Aspects

  1. Security Fix is Effective: The iterative approach correctly handles nested or overlapping script/style tags that could bypass single-pass regex replacement
  2. Clean Implementation: The loop-based solution is readable and follows the suggested fix pattern
  3. Maintains Existing Functionality: Other HTML stripping and whitespace normalization remains unchanged
  4. Added Regex Flags: The s flag (dotall) ensures multiline script/style content is properly matched

🔍 Code Quality Assessment

File: src/lib/search-service.ts:318-332

The modified extractTextContent method:

private extractTextContent(html: string): string {
  // Repeatedly remove <script> and <style> tags and their content
  let sanitized = html;
  let previous;
  do {
    previous = sanitized;
    sanitized = sanitized
      .replace(/<script[^>]*>.*?<\/script>/gis, '')
      .replace(/<style[^>]*>.*?<\/style>/gis, '');
  } while (sanitized !== previous);
  return sanitized
    .replace(/<[^>]*>/g, ' ')
    .replace(/\s+/g, ' ')
    .trim();
}

🛡️ Security Analysis

  • Vulnerability Addressed: ✅ Fixes incomplete sanitization that could allow script injection through nested tags
  • Defense Depth: ✅ Uses iterative removal to handle edge cases like <scr<script>ipt>alert('xss')</script>
  • Input Validation: ✅ Existing input validation in validateSearchQuery provides additional protection
  • XSS Prevention: ✅ Combined with existing sanitization, provides robust protection

Performance Considerations

  • Efficiency: The loop has potential for O(n²) complexity in worst case, but:
    • Input is limited to website HTML content (reasonable size)
    • Loop terminates quickly in normal cases (1-2 iterations max)
    • Performance impact is negligible for typical use cases

🔧 Minor Suggestions

  1. Consider Adding JSDoc: Document the security purpose of this method

    /**
     * Safely extracts text content from HTML by iteratively removing script/style tags
     * to prevent incomplete sanitization vulnerabilities.
     */
  2. Optional: Add Loop Guard (defensive programming):

    let iterations = 0;
    const MAX_ITERATIONS = 10; // Safety guard
    do {
      // ... existing logic
      iterations++;
    } while (sanitized !== previous && iterations < MAX_ITERATIONS);

🧪 Test Coverage Recommendation

Consider adding unit tests for edge cases:

// Test cases to add:
- Nested script tags: `<scr<script>ipt>alert('xss')</script>`
- Mixed case: `<SCRIPT>alert('xss')</SCRIPT>`
- Malformed tags: `<script>alert('xss')<script>`
- Multiple iterations needed

📊 Overall Assessment

Aspect Rating Notes
Security Fix ✅ Excellent Properly addresses the vulnerability
Code Quality ✅ Good Clean, readable implementation
Performance ✅ Acceptable Minor overhead, negligible impact
Maintainability ✅ Good Clear intent and simple logic

Recommendation: Approve

This is a solid security fix that effectively addresses the scanning alert. The implementation is clean, secure, and maintains existing functionality while fixing the vulnerability.

The iterative approach is the correct solution for handling incomplete multi-character sanitization, and the code quality is good. The minor suggestions above are optional improvements but not blockers.

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
@Jackson57279 Jackson57279 marked this pull request as ready for review August 12, 2025 18:48
@claude
Copy link

claude bot commented Aug 12, 2025

Claude finished @otdoges's task —— View job


Code Review: Security Fix for Incomplete Multi-Character Sanitization

Todo List:

  • Read and analyze the modified search-service.ts file
  • Review the extractTextContent method implementation
  • Evaluate security fix effectiveness
  • Check for potential edge cases and improvements
  • Assess code quality and best practices
  • Provide comprehensive feedback

🎯 Review Summary

This PR addresses a legitimate security vulnerability by implementing a superior solution using the sanitize-html library instead of regex-based approaches.

Excellent Security Fix

File: src/lib/search-service.ts:319-333

private extractTextContent(html: string): string {
  const sanitized = sanitizeHtml(html, {
    allowedTags: false,
    disallowedTagsMode: 'discard', 
    exclusiveFilter: (frame) => {
      return frame.tag === 'script' || frame.tag === 'style';
    }
  });
  return sanitized
    .replace(/<[^>]*>/g, ' ')
    .replace(/\s+/g, ' ')
    .trim();
}

Security Benefits:

  • ✅ Uses production-grade sanitize-html@2.17.0 library
  • ✅ Eliminates incomplete multi-character sanitization vulnerability
  • ✅ Handles all edge cases (nested tags, malformed HTML, case variations)
  • ✅ More robust than regex-based approaches
  • ✅ Properly integrated with existing input validation

🔧 Minor Optimization Suggestion

The current configuration has some redundancy. Consider simplifying:

private extractTextContent(html: string): string {
  const sanitized = sanitizeHtml(html, {
    allowedTags: [], // More explicit than false
    allowedAttributes: {},
    disallowedTagsMode: 'discard'
    // exclusiveFilter not needed since we're removing all tags anyway
  });
  return sanitized
    .replace(/\s+/g, ' ')
    .trim();
}

Rationale: Since allowedTags: false removes all tags, the exclusiveFilter and final regex become redundant.

🚨 GitHub Security Alerts

The security scanner is still flagging issues, but these appear to be false positives given the new implementation:

  1. "Incomplete multi-character sanitization" - No longer applicable with sanitize-html
  2. "Bad HTML filtering regexp" - The regex is now only for whitespace normalization, not security

📊 Code Quality Assessment

Aspect Rating Notes
Security ✅ Excellent Production-grade sanitization
Performance ✅ Good Library overhead is minimal
Maintainability ✅ Excellent Clear, well-documented approach
Best Practices ✅ Good Follows security-first principles

🧪 Test Coverage Recommendations

Consider adding unit tests for:

  • Nested script tags: <scr<script>ipt>alert('xss')</script>
  • Style tag variations: <STYLE>body{color:red}</STYLE>
  • Malformed HTML: <script>alert('xss')<script>
  • Performance with large HTML documents

Final Recommendation: APPROVE

This is an excellent security fix that properly addresses the vulnerability using industry best practices. The sanitize-html approach is significantly more secure and robust than regex-based solutions.

Dependencies: The sanitize-html@2.17.0 dependency was properly added to package.json.

The minor optimization suggestion is optional - the current implementation is secure and functional.


@Jackson57279 Jackson57279 merged commit 86bd788 into main Aug 12, 2025
9 of 11 checks passed
@Jackson57279 Jackson57279 deleted the alert-autofix-155 branch August 12, 2025 18:51
@cursor
Copy link

cursor bot commented Aug 12, 2025

Bugbot found 1 bug

To see it, activate your membership in the Cursor dashboard.

Jackson57279 pushed a commit that referenced this pull request Sep 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant