Skip to content

Conversation

@haileyok
Copy link
Collaborator

@haileyok haileyok commented Jan 10, 2026

Adding two new string UDFs:

  • StringTokenize, which converts the given text into a list of individual tokens (split at whitespace or punctuation). Not strictly necessary for this PR, but useful for it and feels like a good time to add it in
  • CheckCensored, which builds a regex for a given input phrase then checks a given input token against the regex

There's some existing functionality for lookalikes in string.py, specifically the StringClean UDF. There's a variety of additional things that I've added in this new UDF though (that have also come up quite a bit in the wild)

  • Still will match terms that attempt to obfuscate with "separator" characters, i.e. using "c___a___t" to try and get around matching for "cat"
  • Handles zero-width spaces that are often used for the same purpose
  • Handles a larger list of characters
  • Allows for matching only when a given term is obfuscated, i.e. it's fine to use "cat" but not okay to use "<4t"

There's likely some additional overhead from using regex for this, but imo the additional flexibility/matching is worth that and the overhead is likely not too extreme, particularly since we're only compiling once per token+config combo anyway.

@haileyok haileyok marked this pull request as ready for review January 14, 2026 03:40
@haileyok haileyok mentioned this pull request Jan 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants