feat: --noise-filter preset + fix binary garbage in parser output#88
Open
Vasco0x4 wants to merge 4 commits intoblacklanternsecurity:masterfrom
Open
feat: --noise-filter preset + fix binary garbage in parser output#88Vasco0x4 wants to merge 4 commits intoblacklanternsecurity:masterfrom
Vasco0x4 wants to merge 4 commits intoblacklanternsecurity:masterfrom
Conversation
- is_text_file() now rejects files where >1% of decoded chars are Unicode replacement chars (U+FFFD), stopping charset-normalizer false positives on PE/DLL/binary files - extract_text() now checks replacement char ratio after ANY extraction path (charset-normalizer or kreuzberg) and falls back to extract_strings_from_binary() when ratio exceeds 1% - Removed grep -a flag to stop binary stdin being treated as text, which was causing massive single-line binary dumps even with -m 5 Fixes: large chunks of \xef\xbf\xbd garbage being logged as matches when binary files were misidentified as text or extracted with corrupt encoding. https://claude.ai/code/session_01HhXFjA6jdctfoi1MTfG9jY
…stem noise Adds two preset modes that auto-populate exclude_dirnames and exclude_extensions with well-known Windows system paths/extensions that clutter results without containing useful data: moderate: PolicyDefinitions (ADMX/ADML), WinSxS, Servicing aggressive: also System32, SysWOW64, Assembly, Fonts, Spool, Defender Both modes also suppress: .adml .admx .mui .mof .cat .manifest The presets feed directly into the existing dir/extension blacklist infrastructure, so they compose cleanly with --exclude-dirnames and --exclude-extensions. https://claude.ai/code/session_01HhXFjA6jdctfoi1MTfG9jY
…FHCR Claude/review project structure dfhcr
|
I have read the CLA Document and I hereby sign the CLA 0 out of 2 committers have signed the CLA. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two independent improvements targeting common pain points when running
MANSPIDER against Windows infrastructure.
1.
--noise-filter {moderate,aggressive}— suppress Windows system noiseRunning against domain controllers or file servers produces massive amounts
of results from Windows system paths (WinSxS, PolicyDefinitions, System32…)
that never contain useful data. This adds a
--noise-filterflag with twopresets:
moderate.adml .admx .mui .mof .cat .manifestaggressivePresets feed directly into the existing
exclude_dirnames/exclude_extensionsinfrastructure, so they compose cleanly with--exclude-dirnamesand--exclude-extensions.Usage:
2. Fix: binary garbage chunks in parser output
Files like PE/DLL/binary were being misidentified as text by
charset-normalizer, producing massive\xef\xbf\xbdgarbage dumps inmatch output. Fixed with:
is_text_file()now rejects files where >1% of decoded chars areUnicode replacement chars (U+FFFD)
extract_text()applies the same ratio check after any extraction pathand falls back to
extract_strings_from_binary()when exceeded-aflag fromgrepto stop binary stdin being treated astext (was causing single-line binary dumps even with
-m 5)Testing
Tested against an internal AD environment with:
.dll,.exe) in accessible shares (parser fix)