Potential fix for code scanning alert no. 122: Incomplete multi-character sanitization#53
Conversation
…cter sanitization Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
WalkthroughUpdated extractTextContent in src/lib/search-service.ts to iteratively remove multi-line <script> and <style> blocks using a /gis regex loop until no further matches, then strip remaining HTML tags and condense whitespace. Exported/public interfaces remain unchanged. Changes
Sequence Diagram(s)sequenceDiagram
actor Caller
participant SearchService as search-service.ts
Caller->>SearchService: extractTextContent(html)
activate SearchService
loop Until no more script/style removed
SearchService->>SearchService: Remove <script>/<style> blocks (regex /gis)
end
SearchService->>SearchService: Strip remaining HTML tags
SearchService->>SearchService: Condense whitespace
SearchService-->>Caller: plainText
deactivate SearchService
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
Poem
✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
|
Claude finished @otdoges's task —— View job PR Code ReviewReviewing the security fix for code scanning alert #122: Incomplete multi-character sanitization. Todo List:
🔒 Security Analysis✅ EXCELLENT FIX - This properly addresses the incomplete multi-character sanitization vulnerability. The ProblemThe original code had a classic security issue where nested HTML tags could bypass sanitization: <script><script>alert('xss')</script></script>
<style><script>alert('xss')</script></style>After the first regex pass, these would become: <script>alert('xss')
alert('xss')Leaving dangerous script tags intact. The SolutionThe fix uses a do-while loop to repeatedly apply the sanitization until no more changes occur, ensuring all nested tags are removed. This is a well-established pattern for this type of vulnerability. 📊 Code Quality Assessment✅ Strengths
|
| sanitized = sanitized | ||
| .replace(/<script[^>]*>.*?<\/script>/gis, '') |
Check failure
Code scanning / CodeQL
Incomplete multi-character sanitization High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 6 months ago
To fix the incomplete multi-character sanitization, we should ensure that all script and style tags are fully removed, even if they are nested, malformed, or appear consecutively. The best way to do this without changing existing functionality is to repeatedly apply the regular expression replacements for <script> and <style> tags until no more matches are found. This can be done with a loop that continues replacing as long as the string changes. Alternatively, we could use a well-tested library like sanitize-html to remove all HTML tags, but since the code only extracts text content, a repeated replacement approach is sufficient and does not require new dependencies. The change should be made in the extractTextContent method in src/lib/search-service.ts, lines 318-325.
| @@ -316,12 +316,22 @@ | ||
| } | ||
|
|
||
| private extractTextContent(html: string): string { | ||
| return html | ||
| .replace(/<script[^>]*>.*?<\/script>/gi, '') | ||
| .replace(/<style[^>]*>.*?<\/style>/gi, '') | ||
| .replace(/<[^>]*>/g, ' ') | ||
| .replace(/\s+/g, ' ') | ||
| .trim(); | ||
| let sanitized = html; | ||
| let previous; | ||
| // Remove all <script> tags and their content repeatedly | ||
| do { | ||
| previous = sanitized; | ||
| sanitized = sanitized.replace(/<script[^>]*>.*?<\/script>/gis, ''); | ||
| } while (sanitized !== previous); | ||
| // Remove all <style> tags and their content repeatedly | ||
| do { | ||
| previous = sanitized; | ||
| sanitized = sanitized.replace(/<style[^>]*>.*?<\/style>/gis, ''); | ||
| } while (sanitized !== previous); | ||
| // Remove all remaining HTML tags | ||
| sanitized = sanitized.replace(/<[^>]*>/g, ' '); | ||
| sanitized = sanitized.replace(/\s+/g, ' ').trim(); | ||
| return sanitized; | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
src/lib/search-service.ts (1)
325-327: Fix regex to match closing tags with optional whitespace and add word boundary.The current regexes won’t match
</script >or</style >(trailing whitespace before>), which CodeQL flagged (“Bad HTML filtering regexp”). Also add\bafter tag names and prefer[\s\S]over thesflag for broader compatibility.Apply this diff:
- .replace(/<script[^>]*>.*?<\/script>/gis, '') - .replace(/<style[^>]*>.*?<\/style>/gis, ''); + .replace(/<script\b[^>]*>[\s\S]*?<\/script\s*>/gi, '') + .replace(/<style\b[^>]*>[\s\S]*?<\/style\s*>/gi, '');
🧹 Nitpick comments (1)
src/lib/search-service.ts (1)
319-328: Guard the loop with a hard cap to prevent pathological cases (optional).If you keep the iterative approach, add a max-iteration cap to avoid unexpected long runtimes on adversarial input.
Apply this diff:
- do { + let iterations = 0; + const MAX_ITERATIONS = 10; + do { previous = sanitized; sanitized = sanitized .replace(/<script\b[^>]*>[\s\S]*?<\/script\s*>/gi, '') .replace(/<style\b[^>]*>[\s\S]*?<\/style\s*>/gi, ''); - } while (sanitized !== previous); + iterations++; + } while (sanitized !== previous && iterations < MAX_ITERATIONS);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/lib/search-service.ts(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{ts,tsx}
📄 CodeRabbit Inference Engine (.cursor/rules/convex-security.mdc)
**/*.{ts,tsx}: All Convex queries and mutations MUST use proper authentication. Never accept user IDs from client parameters.
Always verify user owns the data before allowing access.
Use the authenticated user's identity.subject for user references.
Implement proper error messages that don't leak information.
Authentication verification in every function.
Authorization checks for data ownership.
Input validation and sanitization.
Error handling without information leakage.
**/*.{ts,tsx}: Use Sonner for toast notifications to provide consistent user feedback, including success, error, and loading states.
Always handle errors gracefully using try-catch blocks in asynchronous functions, providing user feedback and logging errors.
Provide specific, actionable error messages for form validation errors using toast notifications.
Handle common network error scenarios in catch blocks, providing appropriate toast messages for network errors, authentication errors, and unexpected errors.If using TypeScript, use an enum to store flag names.
Strict TypeScript must be used with no 'any' types allowed
**/*.{ts,tsx}: NEVER useanytype - use proper TypeScript types
Useunknownfor truly unknown data types
Implement proper interface definitions
Do not use empty interfaces; use a type alias instead (e.g.,type InputProps = ...instead ofinterface InputProps {})
All function parameters must be typed
All return types should be explicit for public APIs
Use proper generic constraints
Implement discriminated unions for state management
Use proper interface definitions for error handling types (e.g.,interface ValidationResult { isValid: boolean; error?: string; })
**/*.{ts,tsx}: Always sanitize user input before storing or displaying using a sanitization function likesanitizeText.
Implement comprehensive input validation, including length checks and detection of malicious patterns, as shown in thevalidateInputfunction.
Define and use security constants suc...
Files:
src/lib/search-service.ts
**/*.{js,jsx,ts,tsx}
📄 CodeRabbit Inference Engine (.cursor/rules/posthog-integration.mdc)
**/*.{js,jsx,ts,tsx}: Use a consistent naming convention for this storage. enum/const object members should be written UPPERCASE_WITH_UNDERSCORE.
If a custom property for a person or event is at any point referenced in two or more files or two or more callsites in the same file, use an enum or const object, as above in feature flags.
Files:
src/lib/search-service.ts
🧠 Learnings (1)
📓 Common learnings
Learnt from: CR
PR: otdoges/zapdev#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-12T03:04:41.034Z
Learning: Applies to src/**/*.tsx : Prevent XSS by sanitizing any user-generated content before rendering (avoid unsafe HTML, or sanitize it)
🪛 GitHub Check: CodeQL
src/lib/search-service.ts
[failure] 324-325: Incomplete multi-character sanitization
This string may still contain <script, which may cause an HTML element injection vulnerability.
[failure] 325-325: Bad HTML filtering regexp
This regular expression does not match script end tags like </script >.
| // Repeatedly remove <script> and <style> tags and their content | ||
| let sanitized = html; | ||
| let previous; | ||
| do { | ||
| previous = sanitized; | ||
| sanitized = sanitized | ||
| .replace(/<script[^>]*>.*?<\/script>/gis, '') | ||
| .replace(/<style[^>]*>.*?<\/style>/gis, ''); | ||
| } while (sanitized !== previous); | ||
| return sanitized | ||
| .replace(/<[^>]*>/g, ' ') | ||
| .replace(/\s+/g, ' ') | ||
| .trim(); |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Add targeted tests for tricky cases (nested, spaced closing tags, multiline).
To prevent regressions and to validate the CodeQL concerns, add unit tests covering:
</script >and</style >(space before>)- Multiline blocks and nested
<script>inside<script> - Upper/lower case tag names
- Malformed tags where the closing tag appears inside strings
I can draft a small test suite (Jest/Vitest) for extractTextContent with these cases if helpful.
🧰 Tools
🪛 GitHub Check: CodeQL
[failure] 324-325: Incomplete multi-character sanitization
This string may still contain <script, which may cause an HTML element injection vulnerability.
[failure] 325-325: Bad HTML filtering regexp
This regular expression does not match script end tags like </script >.
🤖 Prompt for AI Agents
In src/lib/search-service.ts around lines 319 to 331, the HTML sanitization loop
needs targeted unit tests to prevent regressions: add a test suite for
extractTextContent that asserts correct stripping for closing tags with spaces
(e.g., </script >, </style >), multiline script/style blocks, nested script
tags (e.g., a <script> containing another <script>), mixed case tags (e.g.,
<ScRiPt> and </SCRIPT>), and malformed cases where closing tag text appears
inside string literals; implement tests using the project test framework (Jest
or Vitest) that feed these HTML strings into extractTextContent and assert the
returned plain text is correctly stripped and normalized (single spaces,
trimmed).
| // Repeatedly remove <script> and <style> tags and their content | ||
| let sanitized = html; | ||
| let previous; | ||
| do { | ||
| previous = sanitized; | ||
| sanitized = sanitized | ||
| .replace(/<script[^>]*>.*?<\/script>/gis, '') | ||
| .replace(/<style[^>]*>.*?<\/style>/gis, ''); | ||
| } while (sanitized !== previous); | ||
| return sanitized |
There was a problem hiding this comment.
🛠️ Refactor suggestion
❓ Verification inconclusive
Prefer DOM parsing over regex for robust tag stripping (addresses CodeQL alert #122).
Regex-based HTML stripping is brittle (nested tags, malformed markup, edge whitespace). Since this runs in a browser context, use DOMParser to remove script/style safely and extract text. This eliminates the need for iterative regex and should silence both CodeQL alerts.
Proposed replacement within this method:
private extractTextContent(html: string): string {
- // Repeatedly remove <script> and <style> tags and their content
- let sanitized = html;
- let previous;
- do {
- previous = sanitized;
- sanitized = sanitized
- .replace(/<script\b[^>]*>[\s\S]*?<\/script\s*>/gi, '')
- .replace(/<style\b[^>]*>[\s\S]*?<\/style\s*>/gi, '');
- } while (sanitized !== previous);
- return sanitized
- .replace(/<[^>]*>/g, ' ')
- .replace(/\s+/g, ' ')
- .trim();
+ // Prefer DOM parsing for robust removal of script/style content
+ if (typeof DOMParser !== 'undefined') {
+ const parser = new DOMParser();
+ const doc = parser.parseFromString(html, 'text/html');
+ doc.querySelectorAll('script, style').forEach((n) => n.remove());
+ const text = doc.body?.textContent ?? '';
+ return text.replace(/\s+/g, ' ').trim();
+ }
+ // Fallback to regex if DOMParser is unavailable (e.g., non-browser env)
+ let sanitized = html
+ .replace(/<script\b[^>]*>[\s\S]*?<\/script\s*>/gi, '')
+ .replace(/<style\b[^>]*>[\s\S]*?<\/style\s*>/gi, '')
+ .replace(/<[^>]*>/g, ' ');
+ return sanitized.replace(/\s+/g, ' ').trim();
}Verification: after applying, please re-run CodeQL on this PR to confirm alerts 185 and 186 are resolved.
🏁 Script executed:
#!/bin/bash
echo "Checking for DOMParser references and runtime environment indicators..."
grep -R "DOMParser" -n src || echo "No existing DOMParser references in src/"
echo
echo "Checking for jsdom in dependencies..."
grep -R "\"jsdom\"" -n package.json || echo "No jsdom dependency in package.json"
echo
echo "Searching for browser globals usage..."
grep -R "document\." -n src || echo "No document references found in src/"
grep -R "window\." -n src || echo "No window references found in src/"
echo
echo "Locating extractTextContent usages..."
grep -R "extractTextContent" -n src || echo "No calls to extractTextContent found in src/"Length of output: 6218
Use DOMParser for robust <script>/<style> stripping
Relying on iterative regex is brittle and can miss nested or malformed tags. Since this code runs in a browser (and under jsdom in tests), we can safely parse the HTML, remove unwanted nodes, and extract text. In non-browser environments, fall back to the original iterative regex logic.
In src/lib/search-service.ts, update extractTextContent as follows:
private extractTextContent(html: string): string {
- // Repeatedly remove <script> and <style> tags and their content
- let sanitized = html;
- let previous;
- do {
- previous = sanitized;
- sanitized = sanitized
- .replace(/<script[^>]*>.*?<\/script>/gis, '')
- .replace(/<style[^>]*>.*?<\/style>/gis, '');
- } while (sanitized !== previous);
- return sanitized
- .replace(/<[^>]*>/g, ' ')
- .replace(/\s+/g, ' ')
- .trim();
+ // Prefer DOM parsing for robust removal of script/style content
+ if (typeof DOMParser !== 'undefined') {
+ const doc = new DOMParser().parseFromString(html, 'text/html');
+ doc.querySelectorAll('script, style').forEach((n) => n.remove());
+ const text = doc.body?.textContent ?? '';
+ return text.replace(/\s+/g, ' ').trim();
+ }
+ // Fallback to iterative regex removal in non-browser environments
+ let sanitized = html;
+ let previous: string;
+ do {
+ previous = sanitized;
+ sanitized = sanitized
+ .replace(/<script\b[^>]*>[\s\S]*?<\/script\s*>/gi, '')
+ .replace(/<style\b[^>]*>[\s\S]*?<\/style\s*>/gi, '');
+ } while (sanitized !== previous);
+ sanitized = sanitized.replace(/<[^>]*>/g, ' ');
+ return sanitized.replace(/\s+/g, ' ').trim();
}- Eliminates brittle regex loops in browsers/tests by using a real parser.
- Preserves the original iterative stripping when
DOMParseris unavailable. - No new dependencies required.
After merging, please re-run CodeQL to confirm alerts 185 & 186 are resolved.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Repeatedly remove <script> and <style> tags and their content | |
| let sanitized = html; | |
| let previous; | |
| do { | |
| previous = sanitized; | |
| sanitized = sanitized | |
| .replace(/<script[^>]*>.*?<\/script>/gis, '') | |
| .replace(/<style[^>]*>.*?<\/style>/gis, ''); | |
| } while (sanitized !== previous); | |
| return sanitized | |
| private extractTextContent(html: string): string { | |
| // Prefer DOM parsing for robust removal of script/style content | |
| if (typeof DOMParser !== 'undefined') { | |
| const doc = new DOMParser().parseFromString(html, 'text/html'); | |
| doc.querySelectorAll('script, style').forEach(n => n.remove()); | |
| const text = doc.body?.textContent ?? ''; | |
| return text.replace(/\s+/g, ' ').trim(); | |
| } | |
| // Fallback to iterative regex removal in non-browser environments | |
| let sanitized = html; | |
| let previous: string; | |
| do { | |
| previous = sanitized; | |
| sanitized = sanitized | |
| .replace(/<script\b[^>]*>[\s\S]*?<\/script\s*>/gi, '') | |
| .replace(/<style\b[^>]*>[\s\S]*?<\/style\s*>/gi, ''); | |
| } while (sanitized !== previous); | |
| sanitized = sanitized.replace(/<[^>]*>/g, ' '); | |
| return sanitized.replace(/\s+/g, ' ').trim(); | |
| } |
🧰 Tools
🪛 GitHub Check: CodeQL
[failure] 324-325: Incomplete multi-character sanitization
This string may still contain <script, which may cause an HTML element injection vulnerability.
[failure] 325-325: Bad HTML filtering regexp
This regular expression does not match script end tags like </script >.
🤖 Prompt for AI Agents
In src/lib/search-service.ts around lines 319 to 328, the current iterative
regex loop for stripping <script> and <style> is brittle; change
extractTextContent to use DOMParser when available: parse the HTML into a
document, remove all script and style elements (querySelectorAll and remove),
then return document.body ? document.body.textContent || '' :
document.textContent || ''; if DOMParser is undefined (non-browser or older
env), fall back to the existing do/while regex loop exactly as before so
behavior remains unchanged.
| .replace(/<style[^>]*>.*?<\/style>/gi, '') | ||
| // Repeatedly remove <script> and <style> tags and their content | ||
| let sanitized = html; | ||
| let previous; |
There was a problem hiding this comment.
Implicit any: type previous to satisfy strict TS (noImplicitAny).
let previous; infers any, violating strict TS and our guidelines. Type it explicitly.
Apply this diff:
- let previous;
+ let previous: string | null = null;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let previous; | |
| let previous: string | null = null; |
🤖 Prompt for AI Agents
In src/lib/search-service.ts around line 321, the variable declared as "let
previous;" currently has an implicit any; determine the correct type from its
subsequent usages (e.g., string, number, boolean, or a specific interface/union)
and change the declaration to an explicit typed form such as "let previous:
YourType | undefined" (or the exact type without | undefined if always
initialized). Ensure imports/types are adjusted if you reference a custom type.
Potential fix for https://github.com/otdoges/zapdev/security/code-scanning/122
The best way to fix this problem is to use a well-tested HTML sanitization library, such as
sanitize-html, to reliably remove all<script>and<style>tags and their content, as well as any other potentially dangerous HTML. However, if adding a dependency is not an option, a safer alternative is to repeatedly apply the regular expression replacements for<script>and<style>tags until no more matches are found. This ensures that all instances, including those that may appear after previous replacements, are removed.To implement this fix in
src/lib/search-service.ts, update theextractTextContentmethod (lines 318-325) to repeatedly remove<script>and<style>tags and their content until none remain, before proceeding to strip all other tags and whitespace. No new imports are needed for the repeated replacement approach.Suggested fixes powered by Copilot Autofix. Review carefully before merging.
Summary by CodeRabbit