This document tracks planned improvements and future directions for Stringy. Items are organized by priority and timeframe.
Last updated: 2026-03-08
Priority: Medium
Encoding and BinaryFormat enums in src/types/mod.rs lack #[non_exhaustive], which limits forward compatibility. Tag and public structs like ContainerInfo and FoundString already have it.
Priority: Medium
ImportInfo, ExportInfo, and SectionInfo lack explicit new() constructors. Since other public structs use #[non_exhaustive] with constructors, these should follow the same pattern for API consistency.
Priority: Medium
src/extraction/pe_resources/ is conceptually container analysis (parsing PE resource structures), not string extraction. Moving it to container/ would better reflect the data flow.
Priority: Medium
The extraction module imports from classification, creating a bidirectional dependency. Semantic enrichment should move to an orchestration layer that callers control.
Priority: Medium
JSON serialization failures currently use ConfigError, which is misleading. A dedicated SerializationError variant would improve error clarity.
Priority: Low
Replace generic ParseError(String) with InvalidPeError, InvalidElfError, InvalidMachOError for better diagnostics.
Priority: Low
URL_REGEX runs twice on URLs (once in classify_url, again in classify_domain). Could be deduplicated.
Priority: Medium
Some function signatures in docs/src/api.md may not match the current implementation.
Priority: Medium
Document the malware analysis use case, safe handling of untrusted binaries, and limitations when processing packed/obfuscated samples.
Priority: Medium
The deduplication feature is not covered in README.md or docs/src/string-extraction.md.
Priority: Medium
Use cargo-fuzz to fuzz container/*.rs parsers with malformed input. These are the primary attack surface for untrusted binaries.
The following files still exceed the 500-line project limit and should be split:
| File | Lines | Overage |
|---|---|---|
src/container/pe.rs |
661 | +161 |
src/container/elf.rs |
627 | +127 |
src/container/macho.rs |
574 | +74 |
Priority: Medium
extract_load_command_strings() exists in src/extraction/macho_load_commands.rs and the StringSource::LoadCommand variant is defined, but load command extraction is not wired into BasicExtractor. It requires a separate manual call.
Priority: Low
Currently only the first architecture in a fat/universal binary is parsed. Multi-arch support would allow extracting strings from all slices.
Priority: Low
All files in src/classification/patterns/ use once_cell::sync::Lazy. std::sync::LazyLock has been stable since Rust 1.80 and removes the external dependency.
Priority: Low
Section-by-section extraction is embarrassingly parallel. Using rayon could improve throughput on multi-core systems for large binaries, especially combined with memory mapping.
Priority: Low
FoundString fields currently clone strings. Using Cow<str> could avoid allocations when strings can be borrowed directly from mapped memory.
Priority: Low
Most strings have 0-3 tags. SmallVec<[Tag; 4]> would use stack allocation for the common case.
Priority: Low
Allow compile-time selection of output formats (json, yara, table) via Cargo features for smaller binaries.
- Light XREF hinting: Check ELF relocations targeting
.rodataaddresses; strings with inbound relocs rank higher - Capstone-lite pass: Scan for immediates in
.textthat point into string pools; mark as "referenced" (flag only, no CFG) - DWARF skim: Extract function/file names with
gimlito augment context - PDB integration: Use
pdbcrate to enrich imports/function names (no symbol server fetch) - Go build info: Detect Go binaries and extract build paths, module info
- .NET metadata: Surface .NET-specific strings and metadata
- UPX/packer detection: Detect common packers; offer
--expect-upxmode to reduce false negatives
--diff old.bin new.binto highlight string deltas between binary versions--mask commonto drop common libc/CRT strings and reduce noise--profile malwareto enhance tags with suspicious keywords, cloud endpoints, and telemetry beacons- Stable NDJSON schema for pipeline integration with
jqand similar tools
- Split
extraction/mod.rsinto focused submodules (PR #135) - Split all oversized extraction files (
ascii,utf16,dedup,filters,pe_resources,table) - Implement working CLI with format selection, min-length filtering
- Add criterion benchmarks (
benches/) - Set up code coverage in CI (
cargo llvm-cov+ Codecov) - Add
#[non_exhaustive]toTag,OutputFormat,ContainerInfo,FoundString - Add
OutputFormattertrait for extensibility - Fix O(n^2) deduplication using HashSet
- Add
Hashderive toEncodingandStringSourceenums - Fix failing doctests in extraction module
- Create CHANGELOG.md
- Add
#[allow]justification comments to all directives - Wire up CLI to full extraction pipeline
- Add memory-mapped file I/O via
mmap-guard(replacesstd::fs::read)