Potential fix for code scanning alert no. 186: Incomplete multi-character sanitization#62
Conversation
…cter sanitization Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
WalkthroughReplaced regex-based HTML text extraction in BraveSearchService.extractTextContent with a Cheerio-based DOM parsing approach that removes script/style nodes before extracting and normalizing text. Public APIs and other methods remain unchanged. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant BraveSearchService
participant Cheerio
Caller->>BraveSearchService: extractTextContent(html)
BraveSearchService->>Cheerio: load(html)
Cheerio-->>BraveSearchService: DOM
BraveSearchService->>Cheerio: remove(script, style)
BraveSearchService->>Cheerio: root().text()
Cheerio-->>BraveSearchService: textContent
BraveSearchService-->>Caller: collapsed/trimmed text
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
Poem
✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
|
Claude finished @otdoges's task —— View job Code Review Complete ✅Todo List:
Overall Assessment: Excellent Security Fix 🔒This PR successfully addresses CodeQL security alert #186 by replacing a vulnerable regex-based approach with a robust HTML parser solution. The change is well-implemented and significantly improves security. Code Quality & Best Practices ✅Strengths:
Implementation Quality: private extractTextContent(html: string): string {
// Use cheerio to robustly remove <script> and <style> tags and extract text
const $ = cheerio.load(html);
$('script, style').remove();
return $.root().text().replace(/\s+/g, ' ').trim();
}Security Analysis 🛡️Major Security Improvement:
Security Context: Performance Considerations ⚡Performance Impact:
Memory efficiency: The new approach is actually more memory-efficient as it doesn't require multiple string copies during the loop. Potential Issues & Edge Cases 🔍None identified - The implementation handles edge cases well:
Test Coverage Assessment 📋Current state: No tests visible for this specific method in the codebase. Recommendations for follow-up: // Suggested test cases (for future PR):
describe('extractTextContent', () => {
it('should remove script tags and content', () => {
const html = '<div>Hello <script>alert("xss")</script> World</div>';
expect(result).toBe('Hello World');
});
it('should handle malformed script tags', () => {
const html = '<div>Text <script<!--comment-->type="text/javascript">evil()</script> More</div>';
expect(result).toBe('Text More');
});
it('should remove style tags and content', () => {
const html = '<div>Content <style>body{color:red}</style> Text</div>';
expect(result).toBe('Content Text');
});
});Final Verdict ✅Approve with confidence - This is a high-quality security fix that:
No blocking issues found. The implementation is ready for merge. |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/lib/search-service.ts (2)
319-322: Prefer DOMParser in browsers, fallback to Cheerio to cut bundle sizeThis module appears to run client-side; Cheerio is heavy for the browser. Consider using DOMParser when available and falling back to Cheerio (Node/SSR). Also exclude noscript/template and extract from body to avoid bringing in head/title duplicates, and normalize NBSP.
Apply this diff:
- // Use cheerio to robustly remove <script> and <style> tags and extract text - const $ = cheerio.load(html); - $('script, style').remove(); - return $.root().text().replace(/\s+/g, ' ').trim(); + // Prefer DOMParser in browsers; fallback to Cheerio (Node/SSR) + try { + if (typeof window !== 'undefined' && typeof DOMParser !== 'undefined') { + const parser = new DOMParser(); + const doc = parser.parseFromString(html, 'text/html'); + doc.querySelectorAll('script, style, noscript, template').forEach((n) => n.remove()); + const text = + (doc.body && doc.body.textContent) || + (doc.documentElement && doc.documentElement.textContent) || + ''; + return text.replace(/\u00A0/g, ' ').replace(/\s+/g, ' ').trim(); + } + } catch { + // Fall through to Cheerio + } + const $ = cheerio.load(html); + $('script, style, noscript, template').remove(); + const text = $('body').length ? $('body').text() : $.root().text(); + return text.replace(/\u00A0/g, ' ').replace(/\s+/g, ' ').trim();
319-322: If keeping Cheerio-only: remove noscript/template, use body text, and normalize NBSPMinor hardening without changing approach:
- Remove noscript/template (often contain non-UX text).
- Extract from body to avoid duplicating head/title text.
- Normalize NBSP before whitespace collapse.
- const $ = cheerio.load(html); - $('script, style').remove(); - return $.root().text().replace(/\s+/g, ' ').trim(); + const $ = cheerio.load(html); + $('script, style, noscript, template').remove(); + const text = $('body').length ? $('body').text() : $.root().text(); + return text.replace(/\u00A0/g, ' ').replace(/\s+/g, ' ').trim();
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/lib/search-service.ts(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{ts,tsx}
📄 CodeRabbit Inference Engine (.cursor/rules/convex-security.mdc)
**/*.{ts,tsx}: All Convex queries and mutations MUST use proper authentication. Never accept user IDs from client parameters.
Always verify user owns the data before allowing access.
Use the authenticated user's identity.subject for user references.
Implement proper error messages that don't leak information.
Authentication verification in every function.
Authorization checks for data ownership.
Input validation and sanitization.
Error handling without information leakage.
**/*.{ts,tsx}: Use Sonner for toast notifications to provide consistent user feedback, including success, error, and loading states.
Always handle errors gracefully using try-catch blocks in asynchronous functions, providing user feedback and logging errors.
Provide specific, actionable error messages for form validation errors using toast notifications.
Handle common network error scenarios in catch blocks, providing appropriate toast messages for network errors, authentication errors, and unexpected errors.If using TypeScript, use an enum to store flag names.
Strict TypeScript must be used with no 'any' types allowed
**/*.{ts,tsx}: NEVER useanytype - use proper TypeScript types
Useunknownfor truly unknown data types
Implement proper interface definitions
Do not use empty interfaces; use a type alias instead (e.g.,type InputProps = ...instead ofinterface InputProps {})
All function parameters must be typed
All return types should be explicit for public APIs
Use proper generic constraints
Implement discriminated unions for state management
Use proper interface definitions for error handling types (e.g.,interface ValidationResult { isValid: boolean; error?: string; })
**/*.{ts,tsx}: Always sanitize user input before storing or displaying using a sanitization function likesanitizeText.
Implement comprehensive input validation, including length checks and detection of malicious patterns, as shown in thevalidateInputfunction.
Define and use security constants suc...
Files:
src/lib/search-service.ts
**/*.{js,jsx,ts,tsx}
📄 CodeRabbit Inference Engine (.cursor/rules/posthog-integration.mdc)
**/*.{js,jsx,ts,tsx}: Use a consistent naming convention for this storage. enum/const object members should be written UPPERCASE_WITH_UNDERSCORE.
If a custom property for a person or event is at any point referenced in two or more files or two or more callsites in the same file, use an enum or const object, as above in feature flags.
Files:
src/lib/search-service.ts
🔇 Additional comments (1)
src/lib/search-service.ts (1)
319-322: Cheerio-based DOM parsing robustly removes script/style — LGTMSwitching from regex to Cheerio parsing eliminates the multi-character sanitization gap and handles malformed/obfuscated tags more reliably. This should satisfy alert #186.
Potential fix for https://github.com/otdoges/zapdev/security/code-scanning/186
The best way to fix this problem is to use a proper HTML parser to remove
<script>and<style>elements and their content, rather than relying on regular expressions. Sincecheeriois already imported in the file, we can use it to parse the HTML, remove all<script>and<style>elements, and then extract the text content. This approach is robust against malformed or obfuscated tags and does not rely on repeated regex replacements. The change should be made in theextractTextContentmethod (lines 318–332), replacing the current implementation with one that usescheerioto remove the unwanted elements and extract the text.Suggested fixes powered by Copilot Autofix. Review carefully before merging.
Summary by CodeRabbit