Skip to content

Fix locale bug in grep-find-unicode-wrapper BiDi character detection#36

Open
assisted-by-ai wants to merge 5 commits intoKicksecure:masterfrom
assisted-by-ai:claude/test-unicode-wrapper-dfK4T
Open

Fix locale bug in grep-find-unicode-wrapper BiDi character detection#36
assisted-by-ai wants to merge 5 commits intoKicksecure:masterfrom
assisted-by-ai:claude/test-unicode-wrapper-dfK4T

Conversation

@assisted-by-ai
Copy link
Copy Markdown

Summary

Fixed a critical locale-dependent bug in grep-find-unicode-wrapper where Unicode BiDi character detection would fail in non-UTF-8 locales (LANG=C, empty LANG) and produce false positives on normal files containing digits or hex characters.

Key Changes

  • Fixed BiDi character detection pattern: Replaced bash $'\uXXXX' Unicode escape sequences with explicit \x byte sequences in the grep pattern. The $'\uXXXX' syntax is expanded at bash parse time using the caller's locale, not the LC_ALL=C set on the grep command. In non-UTF-8 locales, this caused the literal characters \u to be passed to grep, creating a character class [0-9A-Fu\] that matched almost any file with digits or hex values.

  • Added comprehensive test suite: Created tests/test_grep_find_unicode_wrapper with 483 lines of tests covering:

    • Locale bug validation and documentation
    • Clean files (no false positives)
    • ASCII control characters (check 4)
    • BiDi/Trojan Source characters (CVE-2021-42574)
    • Invisible/zero-width Unicode characters
    • Homoglyph attacks
    • Unicode whitespace and separators
    • Tag characters
    • Malformed/overlong UTF-8 sequences
    • Sneaky embeddings in normal-looking files
    • Edge cases
  • Added helper script: Created usr/libexec/helper-scripts/safe-rm-maybe.bsh to provide a safe file removal function that uses safe-rm if available, otherwise falls back to rm.

  • Updated test runner: Modified run-tests to execute the new comprehensive test suite.

Implementation Details

The fix converts the BiDi character detection from:

$'[\u061C\u200E\u200F\u202A\u202B\u202C\u202D\u202E\u2066\u2067\u2068\u2069]'

To explicit UTF-8 byte sequences:

[\xD8\x9C\xE2\x80\x8E\xE2\x80\x8F\xE2\x80\xAA\xE2\x80\xAB\xE2\x80\xAC\xE2\x80\xAD\xE2\x80\xAE\xE2\x81\xA6\xE2\x81\xA7\xE2\x81\xA8\xE2\x81\xA9]

This ensures the pattern works correctly regardless of the caller's locale. BiDi characters are still caught by checks 1 and 2 (non-ASCII byte detection), so this is purely a false positive fix with no security bypass.

https://claude.ai/code/session_01726gqqGv3oaDV5jLbM6E6h

claude added 5 commits April 1, 2026 16:40
Tests 66 cases across 10 categories: clean files, ASCII control chars,
BiDi/Trojan Source chars, invisible/zero-width chars, homoglyphs,
Unicode spaces, tag characters, malformed UTF-8, sneaky embeddings,
and edge cases. No actual bypass found - checks 1+2 catch all non-ASCII.

Documents a locale bug: check 3's $'\uXXXX' expansion requires a UTF-8
locale. In non-UTF-8 locales the pattern degrades to literal chars
causing false positives, though BiDi detection is still covered by
checks 1+2.

https://claude.ai/code/session_01726gqqGv3oaDV5jLbM6E6h
The wrapper's check 3 uses $'\uXXXX' which is expanded at bash parse
time using the caller's locale - NOT the LC_ALL=C on the grep command.
In non-UTF-8 locales, \u sequences pass through literally, creating a
bracket expression of [0-9A-Fu\] that false-positives on files with
digits, hex values, UUIDs, backslashes, or typical code.

New tests verify this: 5/6 clean ASCII files trigger false positives in
non-UTF-8 locales, 0/6 in UTF-8 locales. Fix: use \x byte sequences.

https://claude.ai/code/session_01726gqqGv3oaDV5jLbM6E6h
- grep-find-unicode-wrapper: check 3 used $'\uXXXX' which requires a
  UTF-8 locale at bash parse time. In non-UTF-8 locales the pattern
  degrades to literal ASCII chars causing false positives. Fixed by
  using \x byte sequences. Old line kept commented out with explanation.
- test script: use get_colors.sh instead of custom color vars, handle
  old-pattern false positives as expected warns not failures.
- run-tests: call tests/test_grep_find_unicode_wrapper.

https://claude.ai/code/session_01726gqqGv3oaDV5jLbM6E6h
…ventions

- Use real grep-find-unicode-wrapper binary instead of reimplementing
  the four grep checks locally. Tests exercise actual code paths.
- Source get_colors.sh for colors instead of custom color variables.
- Create safe-rm-maybe.bsh providing rm-safe-maybe function that uses
  safe-rm if installed, otherwise falls back to rm.
- Use long options (--recursive, --force, --directory, --delete,
  --address-radix, --format) instead of short flags.
- Use `| tee -- "$file_name"` instead of `> "$file_name"` for better
  xtrace output and error handling.
- Rename variable 'file' to 'file_name' to avoid collision with
  the standard unix 'file' utility.
- Remove assumption that tools might not be installed; tests require
  all tools available (installed from source or on disk).

https://claude.ai/code/session_01726gqqGv3oaDV5jLbM6E6h
- Use 'has' from has.sh instead of 'command -v'.
- Use &>/dev/null instead of >/dev/null 2>&1.
- Replace offensive/risky test examples (rm -rf, root escalation) with
  safe alternatives (GOOD/BADX overwrite, harmless error messages).
- Move inline Python for long-line generation to separate script
  usr/libexec/helper-scripts/write-long-line-with-unicode.

https://claude.ai/code/session_01726gqqGv3oaDV5jLbM6E6h
@ArrayBolt3
Copy link
Copy Markdown
Contributor

Rejected, the test scripts are redundant with unicode-testscript and the bug in grep-find-unicode-wrapper was fixed by Patrick almost three months ago. We might integrate non-UTF-8 locale tests into unicode-testscript, but that needs more discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants