Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 10 additions & 3 deletions src/lib/search-service.ts
Original file line number Diff line number Diff line change
Expand Up @@ -316,9 +316,16 @@
}

private extractTextContent(html: string): string {
return html
.replace(/<script[^>]*>.*?<\/script>/gi, '')
.replace(/<style[^>]*>.*?<\/style>/gi, '')
// Repeatedly remove <script> and <style> tags and their content
let sanitized = html;
let previous;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Implicit any: type previous to satisfy strict TS (noImplicitAny).

let previous; infers any, violating strict TS and our guidelines. Type it explicitly.

Apply this diff:

-    let previous;
+    let previous: string | null = null;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let previous;
let previous: string | null = null;
🤖 Prompt for AI Agents
In src/lib/search-service.ts around line 321, the variable declared as "let
previous;" currently has an implicit any; determine the correct type from its
subsequent usages (e.g., string, number, boolean, or a specific interface/union)
and change the declaration to an explicit typed form such as "let previous:
YourType | undefined" (or the exact type without | undefined if always
initialized). Ensure imports/types are adjusted if you reference a custom type.

do {
previous = sanitized;
sanitized = sanitized
.replace(/<script[^>]*>.*?<\/script>/gis, '')
Comment on lines +324 to +325

Check failure

Code scanning / CodeQL

Incomplete multi-character sanitization High

This string may still contain
<script
, which may cause an HTML element injection vulnerability.

Copilot Autofix

AI 6 months ago

To fix the incomplete multi-character sanitization, we should ensure that all script and style tags are fully removed, even if they are nested, malformed, or appear consecutively. The best way to do this without changing existing functionality is to repeatedly apply the regular expression replacements for <script> and <style> tags until no more matches are found. This can be done with a loop that continues replacing as long as the string changes. Alternatively, we could use a well-tested library like sanitize-html to remove all HTML tags, but since the code only extracts text content, a repeated replacement approach is sufficient and does not require new dependencies. The change should be made in the extractTextContent method in src/lib/search-service.ts, lines 318-325.


Suggested changeset 1
src/lib/search-service.ts

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/src/lib/search-service.ts b/src/lib/search-service.ts
--- a/src/lib/search-service.ts
+++ b/src/lib/search-service.ts
@@ -316,12 +316,22 @@
   }
 
   private extractTextContent(html: string): string {
-    return html
-      .replace(/<script[^>]*>.*?<\/script>/gi, '')
-      .replace(/<style[^>]*>.*?<\/style>/gi, '')
-      .replace(/<[^>]*>/g, ' ')
-      .replace(/\s+/g, ' ')
-      .trim();
+    let sanitized = html;
+    let previous;
+    // Remove all <script> tags and their content repeatedly
+    do {
+      previous = sanitized;
+      sanitized = sanitized.replace(/<script[^>]*>.*?<\/script>/gis, '');
+    } while (sanitized !== previous);
+    // Remove all <style> tags and their content repeatedly
+    do {
+      previous = sanitized;
+      sanitized = sanitized.replace(/<style[^>]*>.*?<\/style>/gis, '');
+    } while (sanitized !== previous);
+    // Remove all remaining HTML tags
+    sanitized = sanitized.replace(/<[^>]*>/g, ' ');
+    sanitized = sanitized.replace(/\s+/g, ' ').trim();
+    return sanitized;
   }
 }
 
EOF
@@ -316,12 +316,22 @@
}

private extractTextContent(html: string): string {
return html
.replace(/<script[^>]*>.*?<\/script>/gi, '')
.replace(/<style[^>]*>.*?<\/style>/gi, '')
.replace(/<[^>]*>/g, ' ')
.replace(/\s+/g, ' ')
.trim();
let sanitized = html;
let previous;
// Remove all <script> tags and their content repeatedly
do {
previous = sanitized;
sanitized = sanitized.replace(/<script[^>]*>.*?<\/script>/gis, '');
} while (sanitized !== previous);
// Remove all <style> tags and their content repeatedly
do {
previous = sanitized;
sanitized = sanitized.replace(/<style[^>]*>.*?<\/style>/gis, '');
} while (sanitized !== previous);
// Remove all remaining HTML tags
sanitized = sanitized.replace(/<[^>]*>/g, ' ');
sanitized = sanitized.replace(/\s+/g, ' ').trim();
return sanitized;
}
}

Copilot is powered by AI and may make mistakes. Always verify output.
Unable to commit as this autofix suggestion is now outdated
.replace(/<style[^>]*>.*?<\/style>/gis, '');
} while (sanitized !== previous);
return sanitized
Comment on lines +319 to +328
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

❓ Verification inconclusive

Prefer DOM parsing over regex for robust tag stripping (addresses CodeQL alert #122).

Regex-based HTML stripping is brittle (nested tags, malformed markup, edge whitespace). Since this runs in a browser context, use DOMParser to remove script/style safely and extract text. This eliminates the need for iterative regex and should silence both CodeQL alerts.

Proposed replacement within this method:

   private extractTextContent(html: string): string {
-    // Repeatedly remove <script> and <style> tags and their content
-    let sanitized = html;
-    let previous;
-    do {
-      previous = sanitized;
-      sanitized = sanitized
-        .replace(/<script\b[^>]*>[\s\S]*?<\/script\s*>/gi, '')
-        .replace(/<style\b[^>]*>[\s\S]*?<\/style\s*>/gi, '');
-    } while (sanitized !== previous);
-    return sanitized
-      .replace(/<[^>]*>/g, ' ')
-      .replace(/\s+/g, ' ')
-      .trim();
+    // Prefer DOM parsing for robust removal of script/style content
+    if (typeof DOMParser !== 'undefined') {
+      const parser = new DOMParser();
+      const doc = parser.parseFromString(html, 'text/html');
+      doc.querySelectorAll('script, style').forEach((n) => n.remove());
+      const text = doc.body?.textContent ?? '';
+      return text.replace(/\s+/g, ' ').trim();
+    }
+    // Fallback to regex if DOMParser is unavailable (e.g., non-browser env)
+    let sanitized = html
+      .replace(/<script\b[^>]*>[\s\S]*?<\/script\s*>/gi, '')
+      .replace(/<style\b[^>]*>[\s\S]*?<\/style\s*>/gi, '')
+      .replace(/<[^>]*>/g, ' ');
+    return sanitized.replace(/\s+/g, ' ').trim();
   }

Verification: after applying, please re-run CodeQL on this PR to confirm alerts 185 and 186 are resolved.


🏁 Script executed:

#!/bin/bash
echo "Checking for DOMParser references and runtime environment indicators..."
grep -R "DOMParser" -n src || echo "No existing DOMParser references in src/"
echo
echo "Checking for jsdom in dependencies..."
grep -R "\"jsdom\"" -n package.json || echo "No jsdom dependency in package.json"
echo
echo "Searching for browser globals usage..."
grep -R "document\." -n src || echo "No document references found in src/"
grep -R "window\." -n src || echo "No window references found in src/"
echo
echo "Locating extractTextContent usages..."
grep -R "extractTextContent" -n src || echo "No calls to extractTextContent found in src/"

Length of output: 6218


Use DOMParser for robust <script>/<style> stripping

Relying on iterative regex is brittle and can miss nested or malformed tags. Since this code runs in a browser (and under jsdom in tests), we can safely parse the HTML, remove unwanted nodes, and extract text. In non-browser environments, fall back to the original iterative regex logic.

In src/lib/search-service.ts, update extractTextContent as follows:

   private extractTextContent(html: string): string {
-    // Repeatedly remove <script> and <style> tags and their content
-    let sanitized = html;
-    let previous;
-    do {
-      previous = sanitized;
-      sanitized = sanitized
-        .replace(/<script[^>]*>.*?<\/script>/gis, '')
-        .replace(/<style[^>]*>.*?<\/style>/gis, '');
-    } while (sanitized !== previous);
-    return sanitized
-      .replace(/<[^>]*>/g, ' ')
-      .replace(/\s+/g, ' ')
-      .trim();
+    // Prefer DOM parsing for robust removal of script/style content
+    if (typeof DOMParser !== 'undefined') {
+      const doc = new DOMParser().parseFromString(html, 'text/html');
+      doc.querySelectorAll('script, style').forEach((n) => n.remove());
+      const text = doc.body?.textContent ?? '';
+      return text.replace(/\s+/g, ' ').trim();
+    }
+    // Fallback to iterative regex removal in non-browser environments
+    let sanitized = html;
+    let previous: string;
+    do {
+      previous = sanitized;
+      sanitized = sanitized
+        .replace(/<script\b[^>]*>[\s\S]*?<\/script\s*>/gi, '')
+        .replace(/<style\b[^>]*>[\s\S]*?<\/style\s*>/gi, '');
+    } while (sanitized !== previous);
+    sanitized = sanitized.replace(/<[^>]*>/g, ' ');
+    return sanitized.replace(/\s+/g, ' ').trim();
   }
  • Eliminates brittle regex loops in browsers/tests by using a real parser.
  • Preserves the original iterative stripping when DOMParser is unavailable.
  • No new dependencies required.

After merging, please re-run CodeQL to confirm alerts 185 & 186 are resolved.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Repeatedly remove <script> and <style> tags and their content
let sanitized = html;
let previous;
do {
previous = sanitized;
sanitized = sanitized
.replace(/<script[^>]*>.*?<\/script>/gis, '')
.replace(/<style[^>]*>.*?<\/style>/gis, '');
} while (sanitized !== previous);
return sanitized
private extractTextContent(html: string): string {
// Prefer DOM parsing for robust removal of script/style content
if (typeof DOMParser !== 'undefined') {
const doc = new DOMParser().parseFromString(html, 'text/html');
doc.querySelectorAll('script, style').forEach(n => n.remove());
const text = doc.body?.textContent ?? '';
return text.replace(/\s+/g, ' ').trim();
}
// Fallback to iterative regex removal in non-browser environments
let sanitized = html;
let previous: string;
do {
previous = sanitized;
sanitized = sanitized
.replace(/<script\b[^>]*>[\s\S]*?<\/script\s*>/gi, '')
.replace(/<style\b[^>]*>[\s\S]*?<\/style\s*>/gi, '');
} while (sanitized !== previous);
sanitized = sanitized.replace(/<[^>]*>/g, ' ');
return sanitized.replace(/\s+/g, ' ').trim();
}
🧰 Tools
🪛 GitHub Check: CodeQL

[failure] 324-325: Incomplete multi-character sanitization
This string may still contain <script, which may cause an HTML element injection vulnerability.


[failure] 325-325: Bad HTML filtering regexp
This regular expression does not match script end tags like </script >.

🤖 Prompt for AI Agents
In src/lib/search-service.ts around lines 319 to 328, the current iterative
regex loop for stripping <script> and <style> is brittle; change
extractTextContent to use DOMParser when available: parse the HTML into a
document, remove all script and style elements (querySelectorAll and remove),
then return document.body ? document.body.textContent || '' :
document.textContent || ''; if DOMParser is undefined (non-browser or older
env), fall back to the existing do/while regex loop exactly as before so
behavior remains unchanged.

.replace(/<[^>]*>/g, ' ')
.replace(/\s+/g, ' ')
.trim();
Comment on lines +319 to 331
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add targeted tests for tricky cases (nested, spaced closing tags, multiline).

To prevent regressions and to validate the CodeQL concerns, add unit tests covering:

  • </script > and </style > (space before >)
  • Multiline blocks and nested <script> inside <script>
  • Upper/lower case tag names
  • Malformed tags where the closing tag appears inside strings

I can draft a small test suite (Jest/Vitest) for extractTextContent with these cases if helpful.

🧰 Tools
🪛 GitHub Check: CodeQL

[failure] 324-325: Incomplete multi-character sanitization
This string may still contain <script, which may cause an HTML element injection vulnerability.


[failure] 325-325: Bad HTML filtering regexp
This regular expression does not match script end tags like </script >.

🤖 Prompt for AI Agents
In src/lib/search-service.ts around lines 319 to 331, the HTML sanitization loop
needs targeted unit tests to prevent regressions: add a test suite for
extractTextContent that asserts correct stripping for closing tags with spaces
(e.g., </script >, </style    >), multiline script/style blocks, nested script
tags (e.g., a <script> containing another <script>), mixed case tags (e.g.,
<ScRiPt> and </SCRIPT>), and malformed cases where closing tag text appears
inside string literals; implement tests using the project test framework (Jest
or Vitest) that feed these HTML strings into extractTextContent and assert the
returned plain text is correctly stripped and normalized (single spaces,
trimmed).

Expand Down
Loading