Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 23 additions & 6 deletions src/lib/search-service.ts
Original file line number Diff line number Diff line change
Expand Up @@ -316,12 +316,29 @@ export class BraveSearchService {
}

private extractTextContent(html: string): string {
return html
.replace(/<script[^>]*>.*?<\/script>/gi, '')
.replace(/<style[^>]*>.*?<\/style>/gi, '')
.replace(/<[^>]*>/g, ' ')
.replace(/\s+/g, ' ')
.trim();
if (typeof window !== 'undefined' && typeof window.DOMParser !== 'undefined') {
const parser = new window.DOMParser();
const doc = parser.parseFromString(html, 'text/html');
Comment on lines +319 to +321
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Broaden environment detection — prefer DOMParser presence over window checks

Checking window can fail in workers/iframes/SSR shims. Gate purely on DOMParser existence; instantiate via new DOMParser().

Apply this diff:

-    if (typeof window !== 'undefined' && typeof window.DOMParser !== 'undefined') {
-      const parser = new window.DOMParser();
+    if (typeof DOMParser !== 'undefined') {
+      const parser = new DOMParser();
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (typeof window !== 'undefined' && typeof window.DOMParser !== 'undefined') {
const parser = new window.DOMParser();
const doc = parser.parseFromString(html, 'text/html');
if (typeof DOMParser !== 'undefined') {
const parser = new DOMParser();
const doc = parser.parseFromString(html, 'text/html');
🤖 Prompt for AI Agents
In src/lib/search-service.ts around lines 319 to 321, the code currently checks
for window and window.DOMParser which fails in workers/iframes/SSR shims; change
the guard to check only for DOMParser presence (e.g. if (typeof DOMParser !==
'undefined')) and instantiate the parser with new DOMParser() (not
window.DOMParser), then call parser.parseFromString(html, 'text/html') as
before.

return doc.body.textContent?.replace(/\s+/g, ' ').trim() || '';
} else {
Comment on lines +319 to +323
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

textContent still includes <script>/<style> contents — remove those nodes before extracting text

Using textContent includes the raw contents of <script> and <style> elements. To truly ignore them, strip those nodes before reading text. This also makes behavior deterministic across browsers.

Apply this diff:

-    if (typeof window !== 'undefined' && typeof window.DOMParser !== 'undefined') {
-      const parser = new window.DOMParser();
-      const doc = parser.parseFromString(html, 'text/html');
-      return doc.body.textContent?.replace(/\s+/g, ' ').trim() || '';
+    if (typeof DOMParser !== 'undefined') {
+      const parser = new DOMParser();
+      const doc = parser.parseFromString(html, 'text/html');
+      if (doc.body) {
+        doc.body.querySelectorAll('script,style,noscript').forEach((el) => el.remove());
+      }
+      return doc.body?.textContent?.replace(/\s+/g, ' ').trim() || '';
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (typeof window !== 'undefined' && typeof window.DOMParser !== 'undefined') {
const parser = new window.DOMParser();
const doc = parser.parseFromString(html, 'text/html');
return doc.body.textContent?.replace(/\s+/g, ' ').trim() || '';
} else {
if (typeof DOMParser !== 'undefined') {
const parser = new DOMParser();
const doc = parser.parseFromString(html, 'text/html');
if (doc.body) {
doc.body.querySelectorAll('script,style,noscript').forEach((el) => el.remove());
}
return doc.body?.textContent?.replace(/\s+/g, ' ').trim() || '';
} else {
// …existing fallback logic…
🤖 Prompt for AI Agents
In src/lib/search-service.ts around lines 319 to 323, the current use of
body.textContent still captures <script> and <style> contents; before extracting
text, query and remove all script and style elements from the parsed document
(e.g. doc.querySelectorAll('script,style') and call remove() on each), then read
doc.body.textContent, normalize whitespace with replace(/\s+/g, ' ').trim(), and
return that; keep the existing DOMParser branch and the existing fallback
behavior unchanged.

// Fallback: improved regex-based approach (repeat until no matches)
let sanitized = html;
let previous;
// Remove <script>...</script> blocks repeatedly
do {
previous = sanitized;
sanitized = sanitized.replace(/<script[\s\S]*?>[\s\S]*?<\/script[\s\S]*?>/gi, '');
} while (sanitized !== previous);
// Remove <style>...</style> blocks repeatedly
do {
previous = sanitized;
sanitized = sanitized.replace(/<style[\s\S]*?>[\s\S]*?<\/style[\s\S]*?>/gi, '');
} while (sanitized !== previous);
return sanitized
.replace(/<[^>]*>/g, ' ')
.replace(/\s+/g, ' ')
.trim();
}
}
}

Expand Down