Skip to content

Potential fix for code scanning alert no. 122: Incomplete multi-character sanitization#53

Merged
Jackson57279 merged 1 commit intomainfrom
alert-autofix-122
Aug 12, 2025
Merged

Potential fix for code scanning alert no. 122: Incomplete multi-character sanitization#53
Jackson57279 merged 1 commit intomainfrom
alert-autofix-122

Conversation

@Jackson57279
Copy link
Owner

@Jackson57279 Jackson57279 commented Aug 12, 2025

Potential fix for https://github.com/otdoges/zapdev/security/code-scanning/122

The best way to fix this problem is to use a well-tested HTML sanitization library, such as sanitize-html, to reliably remove all <script> and <style> tags and their content, as well as any other potentially dangerous HTML. However, if adding a dependency is not an option, a safer alternative is to repeatedly apply the regular expression replacements for <script> and <style> tags until no more matches are found. This ensures that all instances, including those that may appear after previous replacements, are removed.

To implement this fix in src/lib/search-service.ts, update the extractTextContent method (lines 318-325) to repeatedly remove <script> and <style> tags and their content until none remain, before proceeding to strip all other tags and whitespace. No new imports are needed for the repeated replacement approach.


Suggested fixes powered by Copilot Autofix. Review carefully before merging.

Summary by CodeRabbit

  • Bug Fixes
    • Improved text extraction from HTML to consistently remove script and style content, even across multi-line blocks.
    • Reduced noise in rendered text and previews by stripping remaining HTML tags and normalizing whitespace.
    • Enhances the clarity of content displayed to users and can improve relevance in features that rely on extracted text (e.g., previews or search).

…cter sanitization

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
@vercel
Copy link

vercel bot commented Aug 12, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Project Deployment Preview Comments Updated (UTC)
zapdev Failed Aug 12, 2025 6:43pm

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 12, 2025

Walkthrough

Updated extractTextContent in src/lib/search-service.ts to iteratively remove multi-line <script> and <style> blocks using a /gis regex loop until no further matches, then strip remaining HTML tags and condense whitespace. Exported/public interfaces remain unchanged.

Changes

Cohort / File(s) Summary of Changes
Text extraction logic
src/lib/search-service.ts
Switch to iterative loop removing multi-line script/style blocks using /gis; continue until no changes. Subsequent HTML tag stripping and whitespace condensation unchanged. No exported/public signature changes.

Sequence Diagram(s)

sequenceDiagram
  actor Caller
  participant SearchService as search-service.ts
  Caller->>SearchService: extractTextContent(html)
  activate SearchService
  loop Until no more script/style removed
    SearchService->>SearchService: Remove <script>/<style> blocks (regex /gis)
  end
  SearchService->>SearchService: Strip remaining HTML tags
  SearchService->>SearchService: Condense whitespace
  SearchService-->>Caller: plainText
  deactivate SearchService
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

Poem

In pages where scripts like ivy twine,
I nibble the tags, line after line.
Loop and loop till silence stays,
Then sweep the crumbs in tidy arrays.
With whiskers bright and text made clean,
I hop through HTML, crisp and serene. 🐇✨

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch alert-autofix-122

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@claude
Copy link

claude bot commented Aug 12, 2025

Claude finished @otdoges's task —— View job


PR Code Review

Reviewing the security fix for code scanning alert #122: Incomplete multi-character sanitization.

Todo List:

  • Read and analyze the changed file
  • Evaluate the security fix implementation
  • Check for potential bugs and edge cases
  • Assess code quality and best practices
  • Review performance considerations
  • Analyze test coverage needs
  • Provide comprehensive feedback

🔒 Security Analysis

✅ EXCELLENT FIX - This properly addresses the incomplete multi-character sanitization vulnerability.

The Problem

The original code had a classic security issue where nested HTML tags could bypass sanitization:

<script><script>alert('xss')</script></script>
<style><script>alert('xss')</script></style>

After the first regex pass, these would become:

<script>alert('xss')
alert('xss')

Leaving dangerous script tags intact.

The Solution

The fix uses a do-while loop to repeatedly apply the sanitization until no more changes occur, ensuring all nested tags are removed. This is a well-established pattern for this type of vulnerability.

📊 Code Quality Assessment

✅ Strengths

  • Correct implementation: The loop logic is sound
  • Clear intent: Good comment explaining the purpose
  • Proper regex flags: Uses gis (global, case-insensitive, dotall) correctly
  • Maintains existing functionality: Still strips all HTML tags and normalizes whitespace
  • No breaking changes: Same function signature and behavior

⚠️ Minor Considerations

  • Variable naming: previous could be more descriptive (e.g., previousState)
  • Performance: Multiple regex passes on large HTML, but acceptable for this use case

🚀 Performance Impact

Overall: Minimal concern

  • Worst case: Deeply nested malicious HTML requires multiple passes
  • Typical case: Clean HTML processes in 1-2 passes
  • Context: Function processes website content up to 2000 chars (line 237), so impact is bounded
  • Trade-off: Slight performance cost for significant security improvement

🧪 Test Coverage

❌ CRITICAL GAP: No existing tests found for extractTextContent or BraveSearchService

Recommended Test Cases:

describe('extractTextContent', () => {
  it('should handle nested script tags', () => {
    expect(extractTextContent('<script><script>alert("xss")</script></script>'))
      .not.toContain('<script>');
  });
  
  it('should handle mixed nested tags', () => {
    expect(extractTextContent('<style><script>evil</script></style>'))
      .not.toContain('<script>');
  });
  
  it('should preserve text content', () => {
    expect(extractTextContent('<p>Hello <script>bad</script> World</p>'))
      .toBe('Hello  World');
  });
});

🔍 Edge Cases Analysis

The fix handles these correctly:

  • ✅ Deeply nested tags: <script><script><script>...</script></script></script>
  • ✅ Mixed tag types: <style><script>...</script></style>
  • ✅ Self-closing variants: <script src="evil.js"/>
  • ✅ Case variations: <SCRIPT>, <Script>
  • ✅ Attribute variations: <script type="text/javascript">

📋 Final Recommendations

APPROVE - Ready to merge with confidence

  1. Security fix is correct and follows industry best practices
  2. No functional regressions expected
  3. Performance impact is acceptable for the security gain

🔄 Follow-up Actions (separate tasks):

  1. Add comprehensive tests for the extractTextContent method
  2. Consider adding tests for the entire BraveSearchService.analyzeWebsite flow
  3. Document the security consideration in code comments if desired

📝 Code Quality Score: 8.5/10

  • Excellent security fix implementation
  • Clean, readable code
  • Missing test coverage prevents perfect score

Great work on addressing this security vulnerability promptly and correctly! 🎯

Comment on lines +324 to +325
sanitized = sanitized
.replace(/<script[^>]*>.*?<\/script>/gis, '')

Check failure

Code scanning / CodeQL

Incomplete multi-character sanitization High

This string may still contain
<script
, which may cause an HTML element injection vulnerability.

Copilot Autofix

AI 6 months ago

To fix the incomplete multi-character sanitization, we should ensure that all script and style tags are fully removed, even if they are nested, malformed, or appear consecutively. The best way to do this without changing existing functionality is to repeatedly apply the regular expression replacements for <script> and <style> tags until no more matches are found. This can be done with a loop that continues replacing as long as the string changes. Alternatively, we could use a well-tested library like sanitize-html to remove all HTML tags, but since the code only extracts text content, a repeated replacement approach is sufficient and does not require new dependencies. The change should be made in the extractTextContent method in src/lib/search-service.ts, lines 318-325.


Suggested changeset 1
src/lib/search-service.ts

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/src/lib/search-service.ts b/src/lib/search-service.ts
--- a/src/lib/search-service.ts
+++ b/src/lib/search-service.ts
@@ -316,12 +316,22 @@
   }
 
   private extractTextContent(html: string): string {
-    return html
-      .replace(/<script[^>]*>.*?<\/script>/gi, '')
-      .replace(/<style[^>]*>.*?<\/style>/gi, '')
-      .replace(/<[^>]*>/g, ' ')
-      .replace(/\s+/g, ' ')
-      .trim();
+    let sanitized = html;
+    let previous;
+    // Remove all <script> tags and their content repeatedly
+    do {
+      previous = sanitized;
+      sanitized = sanitized.replace(/<script[^>]*>.*?<\/script>/gis, '');
+    } while (sanitized !== previous);
+    // Remove all <style> tags and their content repeatedly
+    do {
+      previous = sanitized;
+      sanitized = sanitized.replace(/<style[^>]*>.*?<\/style>/gis, '');
+    } while (sanitized !== previous);
+    // Remove all remaining HTML tags
+    sanitized = sanitized.replace(/<[^>]*>/g, ' ');
+    sanitized = sanitized.replace(/\s+/g, ' ').trim();
+    return sanitized;
   }
 }
 
EOF
@@ -316,12 +316,22 @@
}

private extractTextContent(html: string): string {
return html
.replace(/<script[^>]*>.*?<\/script>/gi, '')
.replace(/<style[^>]*>.*?<\/style>/gi, '')
.replace(/<[^>]*>/g, ' ')
.replace(/\s+/g, ' ')
.trim();
let sanitized = html;
let previous;
// Remove all <script> tags and their content repeatedly
do {
previous = sanitized;
sanitized = sanitized.replace(/<script[^>]*>.*?<\/script>/gis, '');
} while (sanitized !== previous);
// Remove all <style> tags and their content repeatedly
do {
previous = sanitized;
sanitized = sanitized.replace(/<style[^>]*>.*?<\/style>/gis, '');
} while (sanitized !== previous);
// Remove all remaining HTML tags
sanitized = sanitized.replace(/<[^>]*>/g, ' ');
sanitized = sanitized.replace(/\s+/g, ' ').trim();
return sanitized;
}
}

Copilot is powered by AI and may make mistakes. Always verify output.
Unable to commit as this autofix suggestion is now outdated
@Jackson57279 Jackson57279 marked this pull request as ready for review August 12, 2025 19:11
@Jackson57279 Jackson57279 merged commit 575849e into main Aug 12, 2025
11 of 14 checks passed
@Jackson57279 Jackson57279 deleted the alert-autofix-122 branch August 12, 2025 19:14
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
src/lib/search-service.ts (1)

325-327: Fix regex to match closing tags with optional whitespace and add word boundary.

The current regexes won’t match </script > or </style > (trailing whitespace before >), which CodeQL flagged (“Bad HTML filtering regexp”). Also add \b after tag names and prefer [\s\S] over the s flag for broader compatibility.

Apply this diff:

-        .replace(/<script[^>]*>.*?<\/script>/gis, '')
-        .replace(/<style[^>]*>.*?<\/style>/gis, '');
+        .replace(/<script\b[^>]*>[\s\S]*?<\/script\s*>/gi, '')
+        .replace(/<style\b[^>]*>[\s\S]*?<\/style\s*>/gi, '');
🧹 Nitpick comments (1)
src/lib/search-service.ts (1)

319-328: Guard the loop with a hard cap to prevent pathological cases (optional).

If you keep the iterative approach, add a max-iteration cap to avoid unexpected long runtimes on adversarial input.

Apply this diff:

-    do {
+    let iterations = 0;
+    const MAX_ITERATIONS = 10;
+    do {
       previous = sanitized;
       sanitized = sanitized
         .replace(/<script\b[^>]*>[\s\S]*?<\/script\s*>/gi, '')
         .replace(/<style\b[^>]*>[\s\S]*?<\/style\s*>/gi, '');
-    } while (sanitized !== previous);
+      iterations++;
+    } while (sanitized !== previous && iterations < MAX_ITERATIONS);
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 48cd5d9 and 6c2e156.

📒 Files selected for processing (1)
  • src/lib/search-service.ts (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{ts,tsx}

📄 CodeRabbit Inference Engine (.cursor/rules/convex-security.mdc)

**/*.{ts,tsx}: All Convex queries and mutations MUST use proper authentication. Never accept user IDs from client parameters.
Always verify user owns the data before allowing access.
Use the authenticated user's identity.subject for user references.
Implement proper error messages that don't leak information.
Authentication verification in every function.
Authorization checks for data ownership.
Input validation and sanitization.
Error handling without information leakage.

**/*.{ts,tsx}: Use Sonner for toast notifications to provide consistent user feedback, including success, error, and loading states.
Always handle errors gracefully using try-catch blocks in asynchronous functions, providing user feedback and logging errors.
Provide specific, actionable error messages for form validation errors using toast notifications.
Handle common network error scenarios in catch blocks, providing appropriate toast messages for network errors, authentication errors, and unexpected errors.

If using TypeScript, use an enum to store flag names.

Strict TypeScript must be used with no 'any' types allowed

**/*.{ts,tsx}: NEVER use any type - use proper TypeScript types
Use unknown for truly unknown data types
Implement proper interface definitions
Do not use empty interfaces; use a type alias instead (e.g., type InputProps = ... instead of interface InputProps {})
All function parameters must be typed
All return types should be explicit for public APIs
Use proper generic constraints
Implement discriminated unions for state management
Use proper interface definitions for error handling types (e.g., interface ValidationResult { isValid: boolean; error?: string; })

**/*.{ts,tsx}: Always sanitize user input before storing or displaying using a sanitization function like sanitizeText.
Implement comprehensive input validation, including length checks and detection of malicious patterns, as shown in the validateInput function.
Define and use security constants suc...

Files:

  • src/lib/search-service.ts
**/*.{js,jsx,ts,tsx}

📄 CodeRabbit Inference Engine (.cursor/rules/posthog-integration.mdc)

**/*.{js,jsx,ts,tsx}: Use a consistent naming convention for this storage. enum/const object members should be written UPPERCASE_WITH_UNDERSCORE.
If a custom property for a person or event is at any point referenced in two or more files or two or more callsites in the same file, use an enum or const object, as above in feature flags.

Files:

  • src/lib/search-service.ts
🧠 Learnings (1)
📓 Common learnings
Learnt from: CR
PR: otdoges/zapdev#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-12T03:04:41.034Z
Learning: Applies to src/**/*.tsx : Prevent XSS by sanitizing any user-generated content before rendering (avoid unsafe HTML, or sanitize it)
🪛 GitHub Check: CodeQL
src/lib/search-service.ts

[failure] 324-325: Incomplete multi-character sanitization
This string may still contain <script, which may cause an HTML element injection vulnerability.


[failure] 325-325: Bad HTML filtering regexp
This regular expression does not match script end tags like </script >.

Comment on lines +319 to 331
// Repeatedly remove <script> and <style> tags and their content
let sanitized = html;
let previous;
do {
previous = sanitized;
sanitized = sanitized
.replace(/<script[^>]*>.*?<\/script>/gis, '')
.replace(/<style[^>]*>.*?<\/style>/gis, '');
} while (sanitized !== previous);
return sanitized
.replace(/<[^>]*>/g, ' ')
.replace(/\s+/g, ' ')
.trim();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add targeted tests for tricky cases (nested, spaced closing tags, multiline).

To prevent regressions and to validate the CodeQL concerns, add unit tests covering:

  • </script > and </style > (space before >)
  • Multiline blocks and nested <script> inside <script>
  • Upper/lower case tag names
  • Malformed tags where the closing tag appears inside strings

I can draft a small test suite (Jest/Vitest) for extractTextContent with these cases if helpful.

🧰 Tools
🪛 GitHub Check: CodeQL

[failure] 324-325: Incomplete multi-character sanitization
This string may still contain <script, which may cause an HTML element injection vulnerability.


[failure] 325-325: Bad HTML filtering regexp
This regular expression does not match script end tags like </script >.

🤖 Prompt for AI Agents
In src/lib/search-service.ts around lines 319 to 331, the HTML sanitization loop
needs targeted unit tests to prevent regressions: add a test suite for
extractTextContent that asserts correct stripping for closing tags with spaces
(e.g., </script >, </style    >), multiline script/style blocks, nested script
tags (e.g., a <script> containing another <script>), mixed case tags (e.g.,
<ScRiPt> and </SCRIPT>), and malformed cases where closing tag text appears
inside string literals; implement tests using the project test framework (Jest
or Vitest) that feed these HTML strings into extractTextContent and assert the
returned plain text is correctly stripped and normalized (single spaces,
trimmed).

Comment on lines +319 to +328
// Repeatedly remove <script> and <style> tags and their content
let sanitized = html;
let previous;
do {
previous = sanitized;
sanitized = sanitized
.replace(/<script[^>]*>.*?<\/script>/gis, '')
.replace(/<style[^>]*>.*?<\/style>/gis, '');
} while (sanitized !== previous);
return sanitized
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

❓ Verification inconclusive

Prefer DOM parsing over regex for robust tag stripping (addresses CodeQL alert #122).

Regex-based HTML stripping is brittle (nested tags, malformed markup, edge whitespace). Since this runs in a browser context, use DOMParser to remove script/style safely and extract text. This eliminates the need for iterative regex and should silence both CodeQL alerts.

Proposed replacement within this method:

   private extractTextContent(html: string): string {
-    // Repeatedly remove <script> and <style> tags and their content
-    let sanitized = html;
-    let previous;
-    do {
-      previous = sanitized;
-      sanitized = sanitized
-        .replace(/<script\b[^>]*>[\s\S]*?<\/script\s*>/gi, '')
-        .replace(/<style\b[^>]*>[\s\S]*?<\/style\s*>/gi, '');
-    } while (sanitized !== previous);
-    return sanitized
-      .replace(/<[^>]*>/g, ' ')
-      .replace(/\s+/g, ' ')
-      .trim();
+    // Prefer DOM parsing for robust removal of script/style content
+    if (typeof DOMParser !== 'undefined') {
+      const parser = new DOMParser();
+      const doc = parser.parseFromString(html, 'text/html');
+      doc.querySelectorAll('script, style').forEach((n) => n.remove());
+      const text = doc.body?.textContent ?? '';
+      return text.replace(/\s+/g, ' ').trim();
+    }
+    // Fallback to regex if DOMParser is unavailable (e.g., non-browser env)
+    let sanitized = html
+      .replace(/<script\b[^>]*>[\s\S]*?<\/script\s*>/gi, '')
+      .replace(/<style\b[^>]*>[\s\S]*?<\/style\s*>/gi, '')
+      .replace(/<[^>]*>/g, ' ');
+    return sanitized.replace(/\s+/g, ' ').trim();
   }

Verification: after applying, please re-run CodeQL on this PR to confirm alerts 185 and 186 are resolved.


🏁 Script executed:

#!/bin/bash
echo "Checking for DOMParser references and runtime environment indicators..."
grep -R "DOMParser" -n src || echo "No existing DOMParser references in src/"
echo
echo "Checking for jsdom in dependencies..."
grep -R "\"jsdom\"" -n package.json || echo "No jsdom dependency in package.json"
echo
echo "Searching for browser globals usage..."
grep -R "document\." -n src || echo "No document references found in src/"
grep -R "window\." -n src || echo "No window references found in src/"
echo
echo "Locating extractTextContent usages..."
grep -R "extractTextContent" -n src || echo "No calls to extractTextContent found in src/"

Length of output: 6218


Use DOMParser for robust <script>/<style> stripping

Relying on iterative regex is brittle and can miss nested or malformed tags. Since this code runs in a browser (and under jsdom in tests), we can safely parse the HTML, remove unwanted nodes, and extract text. In non-browser environments, fall back to the original iterative regex logic.

In src/lib/search-service.ts, update extractTextContent as follows:

   private extractTextContent(html: string): string {
-    // Repeatedly remove <script> and <style> tags and their content
-    let sanitized = html;
-    let previous;
-    do {
-      previous = sanitized;
-      sanitized = sanitized
-        .replace(/<script[^>]*>.*?<\/script>/gis, '')
-        .replace(/<style[^>]*>.*?<\/style>/gis, '');
-    } while (sanitized !== previous);
-    return sanitized
-      .replace(/<[^>]*>/g, ' ')
-      .replace(/\s+/g, ' ')
-      .trim();
+    // Prefer DOM parsing for robust removal of script/style content
+    if (typeof DOMParser !== 'undefined') {
+      const doc = new DOMParser().parseFromString(html, 'text/html');
+      doc.querySelectorAll('script, style').forEach((n) => n.remove());
+      const text = doc.body?.textContent ?? '';
+      return text.replace(/\s+/g, ' ').trim();
+    }
+    // Fallback to iterative regex removal in non-browser environments
+    let sanitized = html;
+    let previous: string;
+    do {
+      previous = sanitized;
+      sanitized = sanitized
+        .replace(/<script\b[^>]*>[\s\S]*?<\/script\s*>/gi, '')
+        .replace(/<style\b[^>]*>[\s\S]*?<\/style\s*>/gi, '');
+    } while (sanitized !== previous);
+    sanitized = sanitized.replace(/<[^>]*>/g, ' ');
+    return sanitized.replace(/\s+/g, ' ').trim();
   }
  • Eliminates brittle regex loops in browsers/tests by using a real parser.
  • Preserves the original iterative stripping when DOMParser is unavailable.
  • No new dependencies required.

After merging, please re-run CodeQL to confirm alerts 185 & 186 are resolved.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Repeatedly remove <script> and <style> tags and their content
let sanitized = html;
let previous;
do {
previous = sanitized;
sanitized = sanitized
.replace(/<script[^>]*>.*?<\/script>/gis, '')
.replace(/<style[^>]*>.*?<\/style>/gis, '');
} while (sanitized !== previous);
return sanitized
private extractTextContent(html: string): string {
// Prefer DOM parsing for robust removal of script/style content
if (typeof DOMParser !== 'undefined') {
const doc = new DOMParser().parseFromString(html, 'text/html');
doc.querySelectorAll('script, style').forEach(n => n.remove());
const text = doc.body?.textContent ?? '';
return text.replace(/\s+/g, ' ').trim();
}
// Fallback to iterative regex removal in non-browser environments
let sanitized = html;
let previous: string;
do {
previous = sanitized;
sanitized = sanitized
.replace(/<script\b[^>]*>[\s\S]*?<\/script\s*>/gi, '')
.replace(/<style\b[^>]*>[\s\S]*?<\/style\s*>/gi, '');
} while (sanitized !== previous);
sanitized = sanitized.replace(/<[^>]*>/g, ' ');
return sanitized.replace(/\s+/g, ' ').trim();
}
🧰 Tools
🪛 GitHub Check: CodeQL

[failure] 324-325: Incomplete multi-character sanitization
This string may still contain <script, which may cause an HTML element injection vulnerability.


[failure] 325-325: Bad HTML filtering regexp
This regular expression does not match script end tags like </script >.

🤖 Prompt for AI Agents
In src/lib/search-service.ts around lines 319 to 328, the current iterative
regex loop for stripping <script> and <style> is brittle; change
extractTextContent to use DOMParser when available: parse the HTML into a
document, remove all script and style elements (querySelectorAll and remove),
then return document.body ? document.body.textContent || '' :
document.textContent || ''; if DOMParser is undefined (non-browser or older
env), fall back to the existing do/while regex loop exactly as before so
behavior remains unchanged.

.replace(/<style[^>]*>.*?<\/style>/gi, '')
// Repeatedly remove <script> and <style> tags and their content
let sanitized = html;
let previous;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Implicit any: type previous to satisfy strict TS (noImplicitAny).

let previous; infers any, violating strict TS and our guidelines. Type it explicitly.

Apply this diff:

-    let previous;
+    let previous: string | null = null;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let previous;
let previous: string | null = null;
🤖 Prompt for AI Agents
In src/lib/search-service.ts around line 321, the variable declared as "let
previous;" currently has an implicit any; determine the correct type from its
subsequent usages (e.g., string, number, boolean, or a specific interface/union)
and change the declaration to an explicit typed form such as "let previous:
YourType | undefined" (or the exact type without | undefined if always
initialized). Ensure imports/types are adjusted if you reference a custom type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant