Skip to content

High-ratio specialized compressor built in a 48h Human+AI experiment. Uses context mixing to beat zpaq/xz on structured binaries & signals.

License

Notifications You must be signed in to change notification settings

andrezaiats/pmc-compressor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

PMC: The AI-Augmented Compression Experiment

High-Ratio Archiver targeting Structured Binaries, Signals & Raw Data

🚀 The Hook: Built in ~48 hours as a "Human + AI" pair-programming experiment, PMC competes with (and often beats) state-of-the-art tools like zpaq and xz on specific workloads like executables, medical imaging, and audio.

PMC (Predictive Mix Context) is a single-file lossless data compressor written in C. It uses a Context Mixing (PAQ-like) architecture heavily tuned for structured data detection. Unlike general-purpose compressors that treat everything as a stream of bytes, PMC attempts to "understand" the file structure—detecting tables in binaries, pixel grids in raw images, and wave patterns in audio—to apply specialized models.

⚡ Highlights

  • Cyborg Architecture: Combines 52 context models, including Sparse (for binaries), Order-N (for text), and Geometric (for images/audio).
  • Smart Detection: Automatically identifies ELF/PE binaries, BMPs, WAVs, and even headerless Raw Images (like MRI/X-Ray scans) to switch compression strategies dynamically.
  • Adaptive Gating: "Silences" models that are adding noise. If it detects text, it turns off binary models; if it detects code, it turns off text models.
  • 0 Dependencies: Pure C99. Just gcc -O2.

📊 Benchmarks & Results

We tested PMC against industry standards on the Silesia Corpus plus a custom set of specialized files (ELF binary, Raw X-Ray, WAV).

The Verdict:

  • 🏆 PMC Wins: On structured binaries (file, mozilla), medical images (x-ray), and audio (wav), PMC outperforms zpaq -5 and xz -9.
  • ⚔️ Competitive: On mixed data (ooffice, osdb), it trades blows with zpaq.
  • ⚠️ Limitations: On pure literature (dickens, webster), zpaq's deep language models still reign supreme.

1. Specialized Workloads (Where PMC Shines)

File Type File Original xz (LZMA2) zpaq -5 PMC v4.6 vs zpaq
ELF Binary file 6.5 MB 14.7% 12.2% 11.9% -0.3%
Audio WAV (48kHz) 2.8 MB 53.7% 42.2% 27.8% -14.4%
Bitmap BMP (24-bit) 3.2 MB 64.3% 53.1% 49.1% -4.0%
Medical X-Ray (16-bit) 8.4 MB 53.0% 43.3% 45.4% +2.1%*

*Note: PMC v4.6 auto-detects the raw X-Ray geometry without headers, a feature most compressors lack.

2. General Purpose (Silesia Corpus)

File Original gzip zstd xz zpaq -5 PMC
mozilla (Tar) 51.2 MB 37.2% 36.1% 26.4% 23.5% 25.1%
ooffice (Zip) 6.1 MB 50.3% 51.1% 39.4% 28.7% 30.6%
samba (Src) 21.6 MB 25.3% 23.0% 17.5% 14.1% 16.0%
dickens (Txt) 10.2 MB 38.0% 36.0% 27.8% 20.6% 23.2%
TOTAL 224 MB 33.1% 32.3% 24.0% 19.1% 20.5%

🛠 Usage

PMC is a research prototype. It prioritizes ratio over speed. Decompression is symmetric (slow).

Building

gcc -O2 -o compressor compressor.c -lm

Running

# Compress
./compressor c input_file output.pmc

# Decompress
./compressor d output.pmc recovered_file

# Verify
diff input_file recovered_file

🧠 Under the Hood: Technical Architecture

PMC uses a 4-stage pipeline to maximize probability prediction accuracy.

1. Preprocessing & Auto-Detection

Before compression, PMC analyzes the data block to apply reversible transforms:

  • BCJ x86 Filter: Converts relative E8 (CALL) addresses to absolute for better compression of x86 binaries. Applied before compression, reversed after decompression.
  • BMP Vertical Delta: For 24/32-bit BMP images, subtracts each pixel row from the row above.
  • WAV Per-Channel Delta: For 16-bit PCM WAV files, subtracts the previous sample of the same channel. Uses 16-bit arithmetic to handle carry across byte boundaries.
  • Raw Image Auto-Detection: For files without recognized headers (like x-ray), scans candidate row widths (128–16384) and measures vertical byte correlation. If detected, applies a 16-bit vertical delta filter.

2. Prediction Models (52 Total)

Instead of a single algorithm, PMC uses an ensemble of experts:

  • 9 Order-N Context Models: Orders 0, 1, 2, 3, 4, 6, 8, 12, 16. Each maps a hashed byte context to a state byte via finite-state counters.
  • 34 Sparse Context Models: The "secret sauce" for binaries. They look at non-adjacent bytes (e.g., byte[i-4] and byte[i-8]) to find table columns and struct fields. Each uses a 4M-entry tagged state table (~8 MB).
  • Linguistic Models:
    • Word Model: Rolling hash of current alphanumeric word.
    • Word Trigram: Hash of previous two words + current position.
    • Shadow Dictionary: Maintains unigram/bigram tables to predict the next word entirely.
  • Geometric Models:
    • Stride/Record Model: Auto-detects data periodicity per block (e.g., stride=24 for ELF symbol tables) and predicts from blk[i - stride].
  • Correction Models:
    • Match Model: Hash-chain match finder (depth 512 for text, 16 for binary).
    • ICM (Indirect Context Model): Maps Order-6 hashes to adaptive probability counters.
    • LZP (Lempel-Ziv Prediction): Order-4 context hash maps to a predicted next byte.

3. Mixing & Correction

  • Dual Adaptive Logistic Mixers: Mixer 1 uses bit-position and context; Mixer 2 uses bit context. Outputs are averaged in logit domain.
  • Text Block Gating: Blocks detected as text have their 34 sparse models, delta model, and stride model zeroed. This eliminates noise from binary-focused models.
  • 3-Stage APM (Adaptive Probability Map): A post-processing stage that "learns the mixer's errors".
  • Word-Aware APM: (Text only) Learns per-word correction biases — e.g., if the mixer systematically mispredicts a silent letter, this stage learns the correction.
  • SSE (Secondary Symbol Estimation): Final polish using quantized probability and bit history.

4. Entropy Coding

rANS (Asymmetric Numeral Systems): Uses a 4-way interleaved binary rANS coder for high throughput and compression density close to Arithmetic Coding.


🔮 Version History & Extending

Version Change Dickens ELF BMP WAV X-ray
v3 Baseline 872 KB
v4 +34 sparse models 779 KB 1,734 KB
v4.1-4.3 +Filters (BCJ, BMP, WAV) 778 KB 1,607 KB 792 KB
v4.5 +Word-aware SSE 2,359 KB 778 KB 1,608 KB 791 KB 3,868 KB
v4.6 +Raw Image Auto-Detect 2,359 KB 778 KB 1,608 KB 791 KB 3,846 KB

Extending: The codebase is designed to be hackable. The sparse context models are the easiest way to improve compression. To add more, simply increment NUM_SPARSE and add a new byte-skip pattern in process_block().

Limitations:

  • Input size limited to 4 GB.
  • Memory usage is ~600 MB (dominated by sparse models).
  • Single-threaded; symmetric architecture means decompression is slow.

License

MIT License. See LICENSE for details.


Created by André Zaiats & Gemini (Google DeepMind) & Claude Opus 4.6 - February 2026

About

High-ratio specialized compressor built in a 48h Human+AI experiment. Uses context mixing to beat zpaq/xz on structured binaries & signals.

Topics

Resources

License

Stars

Watchers

Forks

Languages