Skip to content

Conversation

khwilliamson
Copy link
Contributor

@khwilliamson khwilliamson commented Sep 14, 2025

This function is described in its comments as 'terrifying', and by its original author, Larry Wall, as "truly awful". As a result, it has been mostly untouched since its introduction in 1993. That means it has not been updated as new language features have been added.

As an example, it does not know about lexical variables, so the code it has for globals just doesn't work on the vast majority of modern day coding practices.

Another example is it knows nothing of UTF-8, and as a result simply changing the input encoding from Latin1 to UTF-8 can result in its outcome being the opposite result.

And it is buggy.

A few years ago, I set out to try to understand it. I added commentary and simplified some overly complicated expressions, but left its behavior unchanged.

Now, I set out to make some changes, and found many more issues than I had earlier. This commit adds commentary about those. Hopefully this will lead to some discussion and a consensus on the way forward.

  • This set of changes does not require a perldelta entry.

khwilliamson referenced this pull request Sep 15, 2025
That also avoids crashing on overrun.
@khwilliamson khwilliamson force-pushed the intuit_more_commentary branch 2 times, most recently from d1fffd6 to 3407c5f Compare September 22, 2025 22:03
This function is described in its comments as 'terrifying', and by its
original author, Larry Wall, as "truly awful".  As a result, it has been
mostly untouched since its introduction in 1993.  That means it has not
been updated as new language features have been added.

As an example, it does not know about lexical variables, so the code it
has for globals just doesn't work on the vast majority of modern day
coding practices.

Another example is it knows nothing of UTF-8, and as a result simply
changing the input encoding from Latin1 to UTF-8 can result in its
outcome being the opposite result.

And it is buggy.

An example of how hard this can be to get right is this fairly common
use in our test suite:

 [$A-Z]

That looks like a character class matching 27 characters.  But wait,
what if there exists a $A and a parameterless subroutine 'Z'.  Then this
could instead be an expression for a subcript.

A few years ago, I set out to try to understand it.  I added commentary
and simplified some overly complicated expressions, but left its
behavior unchanged.

Now, I set out to make some changes, and found many more issues than I
had earlier.  This commit adds commentary about those.  Hopefully this
will lead to some discussion and a consensus on the way forward.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant