Fix locale bug in grep-find-unicode-wrapper BiDi character detection by assisted-by-ai · Pull Request #36 · Kicksecure/helper-scripts

assisted-by-ai · 2026-04-02T11:39:12Z

Summary

Fixed a critical locale-dependent bug in grep-find-unicode-wrapper where Unicode BiDi character detection would fail in non-UTF-8 locales (LANG=C, empty LANG) and produce false positives on normal files containing digits or hex characters.

Key Changes

Fixed BiDi character detection pattern: Replaced bash $'\uXXXX' Unicode escape sequences with explicit \x byte sequences in the grep pattern. The $'\uXXXX' syntax is expanded at bash parse time using the caller's locale, not the LC_ALL=C set on the grep command. In non-UTF-8 locales, this caused the literal characters \u to be passed to grep, creating a character class [0-9A-Fu\] that matched almost any file with digits or hex values.
Added comprehensive test suite: Created tests/test_grep_find_unicode_wrapper with 483 lines of tests covering:
- Locale bug validation and documentation
- Clean files (no false positives)
- ASCII control characters (check 4)
- BiDi/Trojan Source characters (CVE-2021-42574)
- Invisible/zero-width Unicode characters
- Homoglyph attacks
- Unicode whitespace and separators
- Tag characters
- Malformed/overlong UTF-8 sequences
- Sneaky embeddings in normal-looking files
- Edge cases
Added helper script: Created usr/libexec/helper-scripts/safe-rm-maybe.bsh to provide a safe file removal function that uses safe-rm if available, otherwise falls back to rm.
Updated test runner: Modified run-tests to execute the new comprehensive test suite.

Implementation Details

The fix converts the BiDi character detection from:

$'[\u061C\u200E\u200F\u202A\u202B\u202C\u202D\u202E\u2066\u2067\u2068\u2069]'

To explicit UTF-8 byte sequences:

[\xD8\x9C\xE2\x80\x8E\xE2\x80\x8F\xE2\x80\xAA\xE2\x80\xAB\xE2\x80\xAC\xE2\x80\xAD\xE2\x80\xAE\xE2\x81\xA6\xE2\x81\xA7\xE2\x81\xA8\xE2\x81\xA9]

This ensures the pattern works correctly regardless of the caller's locale. BiDi characters are still caught by checks 1 and 2 (non-ASCII byte detection), so this is purely a false positive fix with no security bypass.

https://claude.ai/code/session_01726gqqGv3oaDV5jLbM6E6h

Tests 66 cases across 10 categories: clean files, ASCII control chars, BiDi/Trojan Source chars, invisible/zero-width chars, homoglyphs, Unicode spaces, tag characters, malformed UTF-8, sneaky embeddings, and edge cases. No actual bypass found - checks 1+2 catch all non-ASCII. Documents a locale bug: check 3's $'\uXXXX' expansion requires a UTF-8 locale. In non-UTF-8 locales the pattern degrades to literal chars causing false positives, though BiDi detection is still covered by checks 1+2. https://claude.ai/code/session_01726gqqGv3oaDV5jLbM6E6h

The wrapper's check 3 uses $'\uXXXX' which is expanded at bash parse time using the caller's locale - NOT the LC_ALL=C on the grep command. In non-UTF-8 locales, \u sequences pass through literally, creating a bracket expression of [0-9A-Fu\] that false-positives on files with digits, hex values, UUIDs, backslashes, or typical code. New tests verify this: 5/6 clean ASCII files trigger false positives in non-UTF-8 locales, 0/6 in UTF-8 locales. Fix: use \x byte sequences. https://claude.ai/code/session_01726gqqGv3oaDV5jLbM6E6h

- grep-find-unicode-wrapper: check 3 used $'\uXXXX' which requires a UTF-8 locale at bash parse time. In non-UTF-8 locales the pattern degrades to literal ASCII chars causing false positives. Fixed by using \x byte sequences. Old line kept commented out with explanation. - test script: use get_colors.sh instead of custom color vars, handle old-pattern false positives as expected warns not failures. - run-tests: call tests/test_grep_find_unicode_wrapper. https://claude.ai/code/session_01726gqqGv3oaDV5jLbM6E6h

…ventions - Use real grep-find-unicode-wrapper binary instead of reimplementing the four grep checks locally. Tests exercise actual code paths. - Source get_colors.sh for colors instead of custom color variables. - Create safe-rm-maybe.bsh providing rm-safe-maybe function that uses safe-rm if installed, otherwise falls back to rm. - Use long options (--recursive, --force, --directory, --delete, --address-radix, --format) instead of short flags. - Use `| tee -- "$file_name"` instead of `> "$file_name"` for better xtrace output and error handling. - Rename variable 'file' to 'file_name' to avoid collision with the standard unix 'file' utility. - Remove assumption that tools might not be installed; tests require all tools available (installed from source or on disk). https://claude.ai/code/session_01726gqqGv3oaDV5jLbM6E6h

- Use 'has' from has.sh instead of 'command -v'. - Use &>/dev/null instead of >/dev/null 2>&1. - Replace offensive/risky test examples (rm -rf, root escalation) with safe alternatives (GOOD/BADX overwrite, harmless error messages). - Move inline Python for long-line generation to separate script usr/libexec/helper-scripts/write-long-line-with-unicode. https://claude.ai/code/session_01726gqqGv3oaDV5jLbM6E6h

ArrayBolt3 · 2026-04-08T03:16:45Z

Rejected, the test scripts are redundant with unicode-testscript and the bug in grep-find-unicode-wrapper was fixed by Patrick almost three months ago. We might integrate non-UTF-8 locale tests into unicode-testscript, but that needs more discussion.

claude added 5 commits April 1, 2026 16:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix locale bug in grep-find-unicode-wrapper BiDi character detection#36

Fix locale bug in grep-find-unicode-wrapper BiDi character detection#36
assisted-by-ai wants to merge 5 commits intoKicksecure:masterfrom
assisted-by-ai:claude/test-unicode-wrapper-dfK4T

assisted-by-ai commented Apr 2, 2026

Uh oh!

ArrayBolt3 commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

assisted-by-ai commented Apr 2, 2026

Summary

Key Changes

Implementation Details

Uh oh!

ArrayBolt3 commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants