Potential fix for code scanning alert no. 120: Bad HTML filtering regexp#56
Potential fix for code scanning alert no. 120: Bad HTML filtering regexp#56Jackson57279 merged 2 commits intomainfrom
Conversation
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
|
Warning Rate limit exceeded@otdoges has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 10 minutes and 2 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
WalkthroughUpdated BraveSearchService.extractTextContent to detect browser-like environments and use DOMParser for HTML-to-text extraction; otherwise, it falls back to a regex-based approach with multiline-safe patterns. No public APIs changed; only internal control flow and extraction logic were modified. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant BraveSearchService
participant EnvCheck as Environment Check
participant DOM as DOMParser
participant Regex as Regex Extractor
Caller->>BraveSearchService: extractTextContent(html)
BraveSearchService->>EnvCheck: window && window.DOMParser?
alt Browser-like
BraveSearchService->>DOM: parse(html)
DOM-->>BraveSearchService: doc.body.textContent
BraveSearchService->>BraveSearchService: collapse whitespace
else Non-browser
BraveSearchService->>Regex: remove <script>/<style>, strip tags
Regex-->>BraveSearchService: plain text
end
BraveSearchService-->>Caller: text
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Poem
✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
|
Claude finished @otdoges's task —— View job Code Review Complete ✅Review Tasks:
Security Fix Analysis✅ Excellent Security Improvement Original Issue (line 319-325): // Vulnerable regex approach
.replace(/<script[^>]*>.*?<\/script>/gi, '')
.replace(/<style[^>]*>.*?<\/style>/gi, '')Fixed Implementation (line 319-332): // Secure DOMParser approach with fallback
if (typeof window !== 'undefined' && typeof window.DOMParser !== 'undefined') {
const parser = new window.DOMParser();
const doc = parser.parseFromString(html, 'text/html');
return doc.body.textContent?.replace(/\s+/g, ' ').trim() || '';
}Code Quality Assessment✅ Strengths:
✅ Best Practices Followed:
Performance Considerations✅ Performance Impact:
Security Assessment✅ Security Improvements:
✅ Additional Security Notes:
Test Coverage Recommendations
Minor Suggestions💡 Code Quality Enhancements:
Final Verdict🟢 LGTM - Approve with Confidence This is an excellent security fix that:
The implementation demonstrates good defensive programming with proper environment detection and safe fallbacks. The security improvement is substantial and the code quality is high. Recommendation: Merge after adding unit tests for the |
…cter sanitization Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
🚨 Bugbot Trial ExpiredYour Bugbot trial has expired. Please purchase a license in the Cursor dashboard to continue using Bugbot. |
There was a problem hiding this comment.
Actionable comments posted: 4
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/lib/search-service.ts(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{ts,tsx}
📄 CodeRabbit Inference Engine (.cursor/rules/convex-security.mdc)
**/*.{ts,tsx}: All Convex queries and mutations MUST use proper authentication. Never accept user IDs from client parameters.
Always verify user owns the data before allowing access.
Use the authenticated user's identity.subject for user references.
Implement proper error messages that don't leak information.
Authentication verification in every function.
Authorization checks for data ownership.
Input validation and sanitization.
Error handling without information leakage.
**/*.{ts,tsx}: Use Sonner for toast notifications to provide consistent user feedback, including success, error, and loading states.
Always handle errors gracefully using try-catch blocks in asynchronous functions, providing user feedback and logging errors.
Provide specific, actionable error messages for form validation errors using toast notifications.
Handle common network error scenarios in catch blocks, providing appropriate toast messages for network errors, authentication errors, and unexpected errors.If using TypeScript, use an enum to store flag names.
Strict TypeScript must be used with no 'any' types allowed
**/*.{ts,tsx}: NEVER useanytype - use proper TypeScript types
Useunknownfor truly unknown data types
Implement proper interface definitions
Do not use empty interfaces; use a type alias instead (e.g.,type InputProps = ...instead ofinterface InputProps {})
All function parameters must be typed
All return types should be explicit for public APIs
Use proper generic constraints
Implement discriminated unions for state management
Use proper interface definitions for error handling types (e.g.,interface ValidationResult { isValid: boolean; error?: string; })
**/*.{ts,tsx}: Always sanitize user input before storing or displaying using a sanitization function likesanitizeText.
Implement comprehensive input validation, including length checks and detection of malicious patterns, as shown in thevalidateInputfunction.
Define and use security constants suc...
Files:
src/lib/search-service.ts
**/*.{js,jsx,ts,tsx}
📄 CodeRabbit Inference Engine (.cursor/rules/posthog-integration.mdc)
**/*.{js,jsx,ts,tsx}: Use a consistent naming convention for this storage. enum/const object members should be written UPPERCASE_WITH_UNDERSCORE.
If a custom property for a person or event is at any point referenced in two or more files or two or more callsites in the same file, use an enum or const object, as above in feature flags.
Files:
src/lib/search-service.ts
🪛 GitHub Check: CodeQL
src/lib/search-service.ts
[failure] 325-327: Incomplete multi-character sanitization
This string may still contain <style, which may cause an HTML element injection vulnerability.
[failure] 325-326: Incomplete multi-character sanitization
This string may still contain <script, which may cause an HTML element injection vulnerability.
| if (typeof window !== 'undefined' && typeof window.DOMParser !== 'undefined') { | ||
| const parser = new window.DOMParser(); | ||
| const doc = parser.parseFromString(html, 'text/html'); |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Broaden environment detection — prefer DOMParser presence over window checks
Checking window can fail in workers/iframes/SSR shims. Gate purely on DOMParser existence; instantiate via new DOMParser().
Apply this diff:
- if (typeof window !== 'undefined' && typeof window.DOMParser !== 'undefined') {
- const parser = new window.DOMParser();
+ if (typeof DOMParser !== 'undefined') {
+ const parser = new DOMParser();📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (typeof window !== 'undefined' && typeof window.DOMParser !== 'undefined') { | |
| const parser = new window.DOMParser(); | |
| const doc = parser.parseFromString(html, 'text/html'); | |
| if (typeof DOMParser !== 'undefined') { | |
| const parser = new DOMParser(); | |
| const doc = parser.parseFromString(html, 'text/html'); |
🤖 Prompt for AI Agents
In src/lib/search-service.ts around lines 319 to 321, the code currently checks
for window and window.DOMParser which fails in workers/iframes/SSR shims; change
the guard to check only for DOMParser presence (e.g. if (typeof DOMParser !==
'undefined')) and instantiate the parser with new DOMParser() (not
window.DOMParser), then call parser.parseFromString(html, 'text/html') as
before.
| if (typeof window !== 'undefined' && typeof window.DOMParser !== 'undefined') { | ||
| const parser = new window.DOMParser(); | ||
| const doc = parser.parseFromString(html, 'text/html'); | ||
| return doc.body.textContent?.replace(/\s+/g, ' ').trim() || ''; | ||
| } else { |
There was a problem hiding this comment.
textContent still includes <script>/<style> contents — remove those nodes before extracting text
Using textContent includes the raw contents of <script> and <style> elements. To truly ignore them, strip those nodes before reading text. This also makes behavior deterministic across browsers.
Apply this diff:
- if (typeof window !== 'undefined' && typeof window.DOMParser !== 'undefined') {
- const parser = new window.DOMParser();
- const doc = parser.parseFromString(html, 'text/html');
- return doc.body.textContent?.replace(/\s+/g, ' ').trim() || '';
+ if (typeof DOMParser !== 'undefined') {
+ const parser = new DOMParser();
+ const doc = parser.parseFromString(html, 'text/html');
+ if (doc.body) {
+ doc.body.querySelectorAll('script,style,noscript').forEach((el) => el.remove());
+ }
+ return doc.body?.textContent?.replace(/\s+/g, ' ').trim() || '';📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (typeof window !== 'undefined' && typeof window.DOMParser !== 'undefined') { | |
| const parser = new window.DOMParser(); | |
| const doc = parser.parseFromString(html, 'text/html'); | |
| return doc.body.textContent?.replace(/\s+/g, ' ').trim() || ''; | |
| } else { | |
| if (typeof DOMParser !== 'undefined') { | |
| const parser = new DOMParser(); | |
| const doc = parser.parseFromString(html, 'text/html'); | |
| if (doc.body) { | |
| doc.body.querySelectorAll('script,style,noscript').forEach((el) => el.remove()); | |
| } | |
| return doc.body?.textContent?.replace(/\s+/g, ' ').trim() || ''; | |
| } else { | |
| // …existing fallback logic… |
🤖 Prompt for AI Agents
In src/lib/search-service.ts around lines 319 to 323, the current use of
body.textContent still captures <script> and <style> contents; before extracting
text, query and remove all script and style elements from the parsed document
(e.g. doc.querySelectorAll('script,style') and call remove() on each), then read
doc.body.textContent, normalize whitespace with replace(/\s+/g, ' ').trim(), and
return that; keep the existing DOMParser branch and the existing fallback
behavior unchanged.
src/lib/search-service.ts
Outdated
| // Fallback: original regex-based approach (less safe) | ||
| return html | ||
| .replace(/<script[\s\S]*?>[\s\S]*?<\/script[\s\S]*?>/gi, '') | ||
| .replace(/<style[\s\S]*?>[\s\S]*?<\/style[\s\S]*?>/gi, '') | ||
| .replace(/<[^>]*>/g, ' ') | ||
| .replace(/\s+/g, ' ') | ||
| .trim(); | ||
| } |
There was a problem hiding this comment.
Fallback regex keeps the CodeQL alert alive and remains fragile — remove it and fail safe in non-DOM environments
The fallback regex is exactly what CodeQL flags (“Incomplete multi-character sanitization” at Lines 325-327, 325-326). Keeping it means the alert likely persists. In environments without a DOM (SSR/Node), avoid brittle regex parsing; log and return an empty string instead, or wire in a proper server-side parser in a follow-up.
Apply this diff:
- } else {
- // Fallback: original regex-based approach (less safe)
- return html
- .replace(/<script[\s\S]*?>[\s\S]*?<\/script[\s\S]*?>/gi, '')
- .replace(/<style[\s\S]*?>[\s\S]*?<\/style[\s\S]*?>/gi, '')
- .replace(/<[^>]*>/g, ' ')
- .replace(/\s+/g, ' ')
- .trim();
- }
+ }
+ // No safe DOM available (e.g., SSR/Node) — avoid brittle regex parsing.
+ logger.warn(
+ 'extractTextContent: DOMParser unavailable; returning empty content in non-browser environment'
+ );
+ return '';Optional follow-up (separate change): If you need SSR support, consider introducing a parsing adapter that uses a well-tested library (e.g., html-to-text) only on the server path to avoid regexes while keeping this method synchronous, or refactor to an injected strategy.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Fallback: original regex-based approach (less safe) | |
| return html | |
| .replace(/<script[\s\S]*?>[\s\S]*?<\/script[\s\S]*?>/gi, '') | |
| .replace(/<style[\s\S]*?>[\s\S]*?<\/style[\s\S]*?>/gi, '') | |
| .replace(/<[^>]*>/g, ' ') | |
| .replace(/\s+/g, ' ') | |
| .trim(); | |
| } | |
| } | |
| // No safe DOM available (e.g., SSR/Node) — avoid brittle regex parsing. | |
| logger.warn( | |
| 'extractTextContent: DOMParser unavailable; returning empty content in non-browser environment' | |
| ); | |
| return ''; |
🧰 Tools
🪛 GitHub Check: CodeQL
[failure] 325-327: Incomplete multi-character sanitization
This string may still contain <style, which may cause an HTML element injection vulnerability.
[failure] 325-326: Incomplete multi-character sanitization
This string may still contain <script, which may cause an HTML element injection vulnerability.
🤖 Prompt for AI Agents
In src/lib/search-service.ts around lines 324 to 331, the fallback regex-based
HTML stripping is triggering a CodeQL "Incomplete multi-character sanitization"
alert and is fragile for non-DOM environments; remove the regex fallback and
instead detect the absence of a DOM parser, log a warning/error, and return an
empty string (or a safe default) to fail closed; alternatively add a TODO
comment to wire in a proper server-side parser in a follow-up, but do not
reintroduce regex-based sanitization.
💡 Verification agent
🧩 Analysis chain
Verify the CodeQL alert is fully resolved after removing the regex fallback
Make sure the repository no longer has fragile HTML-stripping regexes that could re-trigger the alert.
Run this script to search for similar patterns:
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Searching for fragile HTML-stripping regex patterns..."
# Common patterns flagged by CodeQL in this context
rg -nH -A1 -e '<script[\\s\\S]*?<\\/script' -e '<style[\\s\\S]*?<\\/style' -e '\\.replace\\(/<[^>]*>/g' -e '<[^>]+>' \
--glob '!**/dist/**' --glob '!**/build/**' || trueLength of output: 510
I’ve corrected the quoting so each pattern is evaluated separately and won’t be merged into one malformed regex. Let’s rerun the search:
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Searching for fragile HTML-stripping regex patterns..."
patterns=(
'<script[\s\S]*?</script>'
'<style[\s\S]*?</style>'
'\.replace\(/<[^>]*>/g'
'<[^>]+>'
)
for pat in "${patterns[@]}"; do
echo "Pattern: $pat"
rg -nH -A1 -e "$pat" --glob '!**/dist/**' --glob '!**/build/**' || true
doneLength of output: 297918
Remove fragile regex-based HTML stripping
The fallback regex in src/lib/search-service.ts still exists and can re-trigger CodeQL alerts:
- File: src/lib/search-service.ts
Lines 326–329:.replace(/<script[\s\S]*?>[\s\S]*?<\/script[\s\S]*?>/gi, '') .replace(/<style[\s\S]*?>[\s\S]*?<\/style[\s\S]*?>/gi, '') .replace(/<[^>]*>/g, ' ')
Please remove or replace this fallback with a proper HTML sanitizer (e.g., DOMPurify).
Also note a manual escape in src/components/WebContainerFailsafe.tsx that uses regex replaces for </>:
- File: src/components/WebContainerFailsafe.tsx
Line 514:${codeInput.replace(/</g, '<').replace(/>/g, '>')}
Consider consolidating all HTML sanitization through your chosen library to eliminate fragile regexes.
🧰 Tools
🪛 GitHub Check: CodeQL
[failure] 325-327: Incomplete multi-character sanitization
This string may still contain <style, which may cause an HTML element injection vulnerability.
[failure] 325-326: Incomplete multi-character sanitization
This string may still contain <script, which may cause an HTML element injection vulnerability.
🤖 Prompt for AI Agents
In src/lib/search-service.ts around lines 324–331, remove the fragile
regex-based HTML stripping fallback and replace it with a proper HTML sanitizer
(e.g., DOMPurify); import and use the sanitizer to sanitize the HTML input and
then collapse whitespace/trim, ensuring you handle server vs browser
environments (use DOMPurify with jsdom or a safe server-side alternative) and
preserve the previous behavior of removing scripts/styles; also update
src/components/WebContainerFailsafe.tsx at the indicated line to use the same
sanitizer/escaping function instead of manual .replace(/</g,
'<').replace(/>/g, '>') so all sanitization is consolidated through the
chosen library and the regex-based removals are eliminated.
Potential fix for https://github.com/otdoges/zapdev/security/code-scanning/120
The best way to fix this problem is to use a well-tested HTML sanitization or parsing library to remove
<script>and<style>tags and their contents, rather than relying on a fragile regular expression. In the context of a TypeScript/JavaScript project, thesanitize-htmlordompurifylibraries are popular choices. However, since we are only extracting text content (not sanitizing for safe HTML), we can use the browser's DOMParser to parse the HTML and extract the text content, which will automatically ignore script and style contents.To implement this, we should replace the current
extractTextContentmethod with one that uses the DOMParser API (available in browsers) to parse the HTML and extract the text content from the resulting document. If this code is intended to run in Node.js, we can use thejsdomlibrary to provide similar functionality.Since we are only allowed to edit the code shown, and not the rest of the codebase, and since the file appears to be client-side (uses '@sentry/react'), we can assume browser APIs are available. Therefore, we will use
DOMParserto parse the HTML and extract the text content.Edit the
extractTextContentmethod insrc/lib/search-service.tsto useDOMParserinstead of regexes.Suggested fixes powered by Copilot Autofix. Review carefully before merging.
Summary by CodeRabbit
Bug Fixes
Refactor