add unicode censored regex, string tokenizer udfs #108

haileyok · 2026-01-10T02:01:22Z

Adding two new string UDFs:

StringTokenize, which converts the given text into a list of individual tokens (split at whitespace or punctuation). Not strictly necessary for this PR, but useful for it and feels like a good time to add it in
CheckCensored, which builds a regex for a given input phrase then checks a given input token against the regex

There's some existing functionality for lookalikes in string.py, specifically the StringClean UDF. There's a variety of additional things that I've added in this new UDF though (that have also come up quite a bit in the wild)

Still will match terms that attempt to obfuscate with "separator" characters, i.e. using "c___a___t" to try and get around matching for "cat"
Handles zero-width spaces that are often used for the same purpose
Handles a larger list of characters
Allows for matching only when a given term is obfuscated, i.e. it's fine to use "cat" but not okay to use "<4t"

There's likely some additional overhead from using regex for this, but imo the additional flexibility/matching is worth that and the overhead is likely not too extreme, particularly since we're only compiling once per token+config combo anyway.

haileyok added 9 commits January 9, 2026 17:53

add unicode censored regex, string tokenizer udfs

171ffe6

register tokenize udf

f16cb55

add another test case

437fd11

function naming

92742f5

linter

26c3810

adjust must be censored tests/capture groups

7dd0062

rm unused assignment

d9640fb

use existing singleton pattern

ada19ea

update censor cache name

060f0fd

haileyok marked this pull request as ready for review January 14, 2026 03:40

haileyok requested review from a team, BinaryFiddler, EXBreder, ayubun, cmttt and jaredmiller13 as code owners January 14, 2026 03:40

haileyok mentioned this pull request Jan 14, 2026

add lists udf #112

Draft

haileyok added 2 commits January 16, 2026 16:50

rename to StringCheckCensored

0f01076

reorganize udf register

3ac4d81

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add unicode censored regex, string tokenizer udfs #108

add unicode censored regex, string tokenizer udfs #108

Uh oh!

haileyok commented Jan 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

add unicode censored regex, string tokenizer udfs #108

Are you sure you want to change the base?

add unicode censored regex, string tokenizer udfs #108

Uh oh!

Conversation

haileyok commented Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

haileyok commented Jan 10, 2026 •

edited

Loading