High-Ratio Archiver targeting Structured Binaries, Signals & Raw Data
🚀 The Hook: Built in ~48 hours as a "Human + AI" pair-programming experiment, PMC competes with (and often beats) state-of-the-art tools like
zpaqandxzon specific workloads like executables, medical imaging, and audio.
PMC (Predictive Mix Context) is a single-file lossless data compressor written in C. It uses a Context Mixing (PAQ-like) architecture heavily tuned for structured data detection. Unlike general-purpose compressors that treat everything as a stream of bytes, PMC attempts to "understand" the file structure—detecting tables in binaries, pixel grids in raw images, and wave patterns in audio—to apply specialized models.
- Cyborg Architecture: Combines 52 context models, including Sparse (for binaries), Order-N (for text), and Geometric (for images/audio).
- Smart Detection: Automatically identifies ELF/PE binaries, BMPs, WAVs, and even headerless Raw Images (like MRI/X-Ray scans) to switch compression strategies dynamically.
- Adaptive Gating: "Silences" models that are adding noise. If it detects text, it turns off binary models; if it detects code, it turns off text models.
- 0 Dependencies: Pure C99. Just
gcc -O2.
We tested PMC against industry standards on the Silesia Corpus plus a custom set of specialized files (ELF binary, Raw X-Ray, WAV).
The Verdict:
- 🏆 PMC Wins: On structured binaries (
file,mozilla), medical images (x-ray), and audio (wav), PMC outperformszpaq -5andxz -9. - ⚔️ Competitive: On mixed data (
ooffice,osdb), it trades blows withzpaq. ⚠️ Limitations: On pure literature (dickens,webster),zpaq's deep language models still reign supreme.
| File Type | File | Original | xz (LZMA2) | zpaq -5 | PMC v4.6 | vs zpaq |
|---|---|---|---|---|---|---|
| ELF Binary | file |
6.5 MB | 14.7% | 12.2% | 11.9% | -0.3% |
| Audio | WAV (48kHz) |
2.8 MB | 53.7% | 42.2% | 27.8% | -14.4% |
| Bitmap | BMP (24-bit) |
3.2 MB | 64.3% | 53.1% | 49.1% | -4.0% |
| Medical | X-Ray (16-bit) |
8.4 MB | 53.0% | 43.3% | 45.4% | +2.1%* |
*Note: PMC v4.6 auto-detects the raw X-Ray geometry without headers, a feature most compressors lack.
| File | Original | gzip | zstd | xz | zpaq -5 | PMC |
|---|---|---|---|---|---|---|
mozilla (Tar) |
51.2 MB | 37.2% | 36.1% | 26.4% | 23.5% | 25.1% |
ooffice (Zip) |
6.1 MB | 50.3% | 51.1% | 39.4% | 28.7% | 30.6% |
samba (Src) |
21.6 MB | 25.3% | 23.0% | 17.5% | 14.1% | 16.0% |
dickens (Txt) |
10.2 MB | 38.0% | 36.0% | 27.8% | 20.6% | 23.2% |
| TOTAL | 224 MB | 33.1% | 32.3% | 24.0% | 19.1% | 20.5% |
PMC is a research prototype. It prioritizes ratio over speed. Decompression is symmetric (slow).
Building
gcc -O2 -o compressor compressor.c -lmRunning
# Compress
./compressor c input_file output.pmc
# Decompress
./compressor d output.pmc recovered_file
# Verify
diff input_file recovered_filePMC uses a 4-stage pipeline to maximize probability prediction accuracy.
Before compression, PMC analyzes the data block to apply reversible transforms:
- BCJ x86 Filter: Converts relative E8 (CALL) addresses to absolute for better compression of x86 binaries. Applied before compression, reversed after decompression.
- BMP Vertical Delta: For 24/32-bit BMP images, subtracts each pixel row from the row above.
- WAV Per-Channel Delta: For 16-bit PCM WAV files, subtracts the previous sample of the same channel. Uses 16-bit arithmetic to handle carry across byte boundaries.
- Raw Image Auto-Detection: For files without recognized headers (like x-ray), scans candidate row widths (128–16384) and measures vertical byte correlation. If detected, applies a 16-bit vertical delta filter.
Instead of a single algorithm, PMC uses an ensemble of experts:
- 9 Order-N Context Models: Orders 0, 1, 2, 3, 4, 6, 8, 12, 16. Each maps a hashed byte context to a state byte via finite-state counters.
- 34 Sparse Context Models: The "secret sauce" for binaries. They look at non-adjacent bytes (e.g.,
byte[i-4]andbyte[i-8]) to find table columns and struct fields. Each uses a 4M-entry tagged state table (~8 MB). - Linguistic Models:
- Word Model: Rolling hash of current alphanumeric word.
- Word Trigram: Hash of previous two words + current position.
- Shadow Dictionary: Maintains unigram/bigram tables to predict the next word entirely.
- Geometric Models:
- Stride/Record Model: Auto-detects data periodicity per block (e.g., stride=24 for ELF symbol tables) and predicts from
blk[i - stride].
- Stride/Record Model: Auto-detects data periodicity per block (e.g., stride=24 for ELF symbol tables) and predicts from
- Correction Models:
- Match Model: Hash-chain match finder (depth 512 for text, 16 for binary).
- ICM (Indirect Context Model): Maps Order-6 hashes to adaptive probability counters.
- LZP (Lempel-Ziv Prediction): Order-4 context hash maps to a predicted next byte.
- Dual Adaptive Logistic Mixers: Mixer 1 uses bit-position and context; Mixer 2 uses bit context. Outputs are averaged in logit domain.
- Text Block Gating: Blocks detected as text have their 34 sparse models, delta model, and stride model zeroed. This eliminates noise from binary-focused models.
- 3-Stage APM (Adaptive Probability Map): A post-processing stage that "learns the mixer's errors".
- Word-Aware APM: (Text only) Learns per-word correction biases — e.g., if the mixer systematically mispredicts a silent letter, this stage learns the correction.
- SSE (Secondary Symbol Estimation): Final polish using quantized probability and bit history.
rANS (Asymmetric Numeral Systems): Uses a 4-way interleaved binary rANS coder for high throughput and compression density close to Arithmetic Coding.
| Version | Change | Dickens | ELF | BMP | WAV | X-ray |
|---|---|---|---|---|---|---|
| v3 | Baseline | — | 872 KB | — | — | — |
| v4 | +34 sparse models | — | 779 KB | 1,734 KB | — | — |
| v4.1-4.3 | +Filters (BCJ, BMP, WAV) | — | 778 KB | 1,607 KB | 792 KB | — |
| v4.5 | +Word-aware SSE | 2,359 KB | 778 KB | 1,608 KB | 791 KB | 3,868 KB |
| v4.6 | +Raw Image Auto-Detect | 2,359 KB | 778 KB | 1,608 KB | 791 KB | 3,846 KB |
Extending:
The codebase is designed to be hackable. The sparse context models are the easiest way to improve compression. To add more, simply increment NUM_SPARSE and add a new byte-skip pattern in process_block().
Limitations:
- Input size limited to 4 GB.
- Memory usage is ~600 MB (dominated by sparse models).
- Single-threaded; symmetric architecture means decompression is slow.
MIT License. See LICENSE for details.
Created by André Zaiats & Gemini (Google DeepMind) & Claude Opus 4.6 - February 2026