Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parseHTML() unsuitable for XML/XHTML with self closing tags: textContent() returns next siblings' textContent #2644

Closed
gaabora opened this issue Aug 12, 2022 · 7 comments
Labels
bug docs evaluation needed proposal needs to be validated or tested before fully implementing it in k6

Comments

@gaabora
Copy link

gaabora commented Aug 12, 2022

Brief summary

see steps to reproduce

k6 version

0.39.0

OS

Windows 10

Docker version and image (if applicable)

No response

Steps to reproduce the problem

import { parseHTML } from 'k6/html'

export const options = {
  executor: 'shared-iterations',
}

export default function () {
  const html = '<!DOCTYPE html><html><body><div data-test="123"/><h1>My Heading</h1><p>My paragraph.</p></body></html>'
  const parsed = parseHTML(html)
  const value = parsed.find('div[data-test]').get(0).getAttribute('data-test')
  const text = parsed.find('div[data-test]').get(0).textContent()
  console.error({ value, text })
}

Expected behaviour

ERRO[0000] {"value":"123","text":""} source=console

Actual behaviour

ERRO[0000] {"value":"123","text":"My HeadingMy paragraph."} source=console

@gaabora gaabora added the bug label Aug 12, 2022
@na-- na-- added the evaluation needed proposal needs to be validated or tested before fully implementing it in k6 label Aug 12, 2022
@na--
Copy link
Member

na-- commented Aug 12, 2022

Thanks for opening this issue, though I am not sure this is a bug, exactly 🤔 It seems pretty clear from the code that this is actually the intended behavior:

func (e Element) TextContent() string {
return e.sel.sel.Text()
}

// Text gets the combined text contents of each element in the set of matched
// elements, including their descendants.
func (s *Selection) Text() string {
var buf bytes.Buffer
// Slightly optimized vs calling Each: no single selection object created
var f func(*html.Node)
f = func(n *html.Node) {
if n.Type == html.TextNode {
// Keep newlines and spaces, like jQuery
buf.WriteString(n.Data)
}
if n.FirstChild != nil {
for c := n.FirstChild; c != nil; c = c.NextSibling {
f(c)
}
}
}
for _, n := range s.Nodes {
f(n)
}
return buf.String()
}

And it seems that it roughly matches the Web API it is attempting to mimic, which also returns the text of the descendants: https://developer.mozilla.org/en-US/docs/Web/API/Node/textContent

So, yeah, this seems like it's working as intended. What is your use case and why is this behavior a problem for you?

@gaabora
Copy link
Author

gaabora commented Aug 12, 2022

<body>
  <div data-test="123"/>
  <h1>My Heading</h1>
  <p>My paragraph.</p>
</body>

h1 and p are not descendants of the div, div is self-closed as already mentioned, it does not have any descendants.

@gaabora
Copy link
Author

gaabora commented Aug 12, 2022

Weird, so browser according to web API also wraps them inside...
Actual use case is parsing xml document with some self-closed tags and getting text content of those tags.

@gaabora
Copy link
Author

gaabora commented Aug 12, 2022

Ok, sorry for disturbing, issue should be closed then,
For now, used this workaround to parse xml with self-closing tags with a regular expression, replacing them with normally closed ones:
'<!DOCTYPE html><html><body><div data-test="123"/><h1>My Heading</h1><p>My paragraph.</p></body></html>'.replace(/<(\w+)([^>]*)\/>/g, '<$1$2></$1>')

@gaabora gaabora closed this as completed Aug 12, 2022
@na--
Copy link
Member

na-- commented Aug 12, 2022

Ah, sorry, I apparently missed the "self closing" part of the title 🤦‍♂️ ☕

I guess this might also be reason enough to add explicit XML-parsing API in k6 as well 🤔 In the past, we've encouraged people to abuse parseHTML() as a quick and dirty alternative for also parsing XML content, which this shows might be a bad idea... 😞

@na--
Copy link
Member

na-- commented Aug 12, 2022

I'll reopen this and adjust the title a bit.

While k6 might follow browser conventions, I think this is something worth fixing, probably by having a proper XML parsing API. Which also might be needed for other purposes, e.g. #1539. Or, at the very least, we should document this limitation of parseHTML when it comes to XHTLM/XML content...

FWIW, parseHTML() uses https://github.com/PuerkitoBio/goquery, which in turn uses golang.org/x/net/html, which seems to have a bunch of FrozenDueToAge issues about self-closing tags 🤦‍♂️

@na-- na-- reopened this Aug 12, 2022
@na-- na-- changed the title bug: parseHTML() Selection textContent() returns next siblings textContent for self closing tags parseHTML() unsuitable for XML/XHTML with self closing tags: textContent() returns next siblings' textContent Aug 12, 2022
@na-- na-- changed the title parseHTML() unsuitable for XML/XHTML with self closing tags: textContent() returns next siblings' textContent parseHTML() unsuitable for XML/XHTML with self closing tags: textContent() returns next siblings' textContent Aug 12, 2022
@na-- na-- added the docs label Aug 12, 2022
@codebien
Copy link
Contributor

codebien commented Aug 12, 2024

As pointed out in the previous comment, this is a specific limit of the goquery library and its engine included in x/net/html.

Based on the lack of reactions, we don't plan to address this specific issue in k6. In the future, we might add a better html implementation if there is a need. Probably, in an extension more than a native module in k6 core.

These days the browser module exists and it might be an alternative for some of the cases originally covered by the HTML module.

@codebien codebien closed this as not planned Won't fix, can't repro, duplicate, stale Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug docs evaluation needed proposal needs to be validated or tested before fully implementing it in k6
Projects
None yet
Development

No branches or pull requests

3 participants