Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Obsolete attributes regex doesn't account for valueless attribute use #3

Closed
mattbrundage opened this issue Aug 16, 2024 · 10 comments
Closed
Assignees

Comments

@mattbrundage
Copy link

Valueless attribute use is common with boolean attributes. Among your list of obsolete attributes, "noshade" and "nowrap" need special handling to account for scenarios such as <th nowrap>, but while also avoiding false positives such as <th title="The nowrap attribute is obsolete">.

Using regex on HTML is a minefield.

@j9t
Copy link
Owner

j9t commented Aug 16, 2024

Yes! This must be caught and dealt with appropriately. Will work on an update, thanks!

@j9t j9t self-assigned this Aug 16, 2024
@FabianBeiner
Copy link

const attributeRegex = new RegExp(`<[^>]*\\b${attribute}\\b(?=\\s*(=|\\s*[/]*>))`, 'i');

maybe?

@j9t
Copy link
Owner

j9t commented Aug 17, 2024

Thanks @FabianBeiner for the proactive feedback! Was just about to jump at this after preparing a (basic) test, but that regex seems to work!

Prepared PR #4 for this. Feel free to review and leave feedback—as the whole project is an AI test case, I’m working with and testing a number of tools to support, too.

@j9t
Copy link
Owner

j9t commented Aug 17, 2024

Wanted to make more time available for review, but accidentally merged the PR—still open to feedback if you have more thoughts! Thanks for the report, @mattbrundage, and the update, @FabianBeiner!

@j9t j9t closed this as completed Aug 17, 2024
@mattbrundage
Copy link
Author

@FabianBeiner your solution is an incremental improvement, but still matches sequences such as <th class=nowrap>, which happens to be a common convention in my own projects.

@FabianBeiner
Copy link

@FabianBeiner your solution is an incremental improvement, but still matches sequences such as <th class=nowrap>, which happens to be a common convention in my own projects.

Oh, the infamous unquoted attribute value syntax, of course I forgot about that. 🙈 Which brings us back to your original words:

Using regex on HTML is a minefield.

However, here is an update:

const attributeRegex = new RegExp(`<[^>]*\\s${attribute}\\b(\\s*=\\s*(?:"[^"]*"|'[^']*'|[^"'\\s>]+))?\\s*(?=/?>)`, 'i');

With attributes, we most likely will see a space before them, and I tried to consider that people might also use ' instead of " or nothing at all.

This should not work on

<th nowrap>
<th nowrap=nowrap>
<th nowrap="nowrap">
<th nowrap='nowrap'>
<th class="nowrap" nowrap>
<th class="nowrap" nowrap=nowrap>
<th class="nowrap" nowrap="nowrap">
<th class="nowrap" nowrap='nowrap'>

but not on

<th class="something nowrap">
<th class=nowrap>
<th title=nowrap>
<th title="The nowrap attribute is obsolete">

🤞🏻

@mattbrundage
Copy link
Author

@FabianBeiner LGTM

image

@j9t j9t reopened this Aug 19, 2024
@j9t
Copy link
Owner

j9t commented Aug 19, 2024

(Reopening, will review! Thanks for the updates…!)

@j9t
Copy link
Owner

j9t commented Aug 19, 2024

Prepared #9 for this, including a better test case (these may still be poor, but this one should finally catch what you described, @mattbrundage). The new regex seems to work well here, @FabianBeiner!

(Will let this sit for a moment and not merge right away.)

@j9t
Copy link
Owner

j9t commented Aug 20, 2024

Just pushed 1.6.2, which contains improvements.

Can imagine this needs more refining, but rely on issue reports right now to take care of it. Plan to review and extend the tests to cover more cases there.

@j9t j9t closed this as completed Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants