PMC: The AI-Augmented Compression Experiment

High-Ratio Archiver targeting Structured Binaries, Signals & Raw Data

🚀 The Hook: Built in ~48 hours as a "Human + AI" pair-programming experiment, PMC competes with (and often beats) state-of-the-art tools like zpaq and xz on specific workloads like executables, medical imaging, and audio.

PMC (Predictive Mix Context) is a single-file lossless data compressor written in C. It uses a Context Mixing (PAQ-like) architecture heavily tuned for structured data detection. Unlike general-purpose compressors that treat everything as a stream of bytes, PMC attempts to "understand" the file structure—detecting tables in binaries, pixel grids in raw images, and wave patterns in audio—to apply specialized models.

⚡ Highlights

Cyborg Architecture: Combines 52 context models, including Sparse (for binaries), Order-N (for text), and Geometric (for images/audio).
Smart Detection: Automatically identifies ELF/PE binaries, BMPs, WAVs, and even headerless Raw Images (like MRI/X-Ray scans) to switch compression strategies dynamically.
Adaptive Gating: "Silences" models that are adding noise. If it detects text, it turns off binary models; if it detects code, it turns off text models.
0 Dependencies: Pure C99. Just gcc -O2.

📊 Benchmarks & Results

We tested PMC against industry standards on the Silesia Corpus plus a custom set of specialized files (ELF binary, Raw X-Ray, WAV).

The Verdict:

🏆 PMC Wins: On structured binaries (file, mozilla), medical images (x-ray), and audio (wav), PMC outperforms zpaq -5 and xz -9.
⚔️ Competitive: On mixed data (ooffice, osdb), it trades blows with zpaq.
⚠️ Limitations: On pure literature (dickens, webster), zpaq's deep language models still reign supreme.

1. Specialized Workloads (Where PMC Shines)

File Type	File	Original	xz (LZMA2)	zpaq -5	PMC v4.6	vs zpaq
ELF Binary	`file`	6.5 MB	14.7%	12.2%	11.9%	-0.3%
Audio	`WAV (48kHz)`	2.8 MB	53.7%	42.2%	27.8%	-14.4%
Bitmap	`BMP (24-bit)`	3.2 MB	64.3%	53.1%	49.1%	-4.0%
Medical	`X-Ray (16-bit)`	8.4 MB	53.0%	43.3%	45.4%	+2.1%*

*Note: PMC v4.6 auto-detects the raw X-Ray geometry without headers, a feature most compressors lack.

2. General Purpose (Silesia Corpus)

File	Original	gzip	zstd	xz	zpaq -5	PMC
`mozilla` (Tar)	51.2 MB	37.2%	36.1%	26.4%	23.5%	25.1%
`ooffice` (Zip)	6.1 MB	50.3%	51.1%	39.4%	28.7%	30.6%
`samba` (Src)	21.6 MB	25.3%	23.0%	17.5%	14.1%	16.0%
`dickens` (Txt)	10.2 MB	38.0%	36.0%	27.8%	20.6%	23.2%
TOTAL	224 MB	33.1%	32.3%	24.0%	19.1%	20.5%

🛠 Usage

PMC is a research prototype. It prioritizes ratio over speed. Decompression is symmetric (slow).

Building

gcc -O2 -o compressor compressor.c -lm

Running

# Compress
./compressor c input_file output.pmc

# Decompress
./compressor d output.pmc recovered_file

# Verify
diff input_file recovered_file

🧠 Under the Hood: Technical Architecture

PMC uses a 4-stage pipeline to maximize probability prediction accuracy.

1. Preprocessing & Auto-Detection

Before compression, PMC analyzes the data block to apply reversible transforms:

BCJ x86 Filter: Converts relative E8 (CALL) addresses to absolute for better compression of x86 binaries. Applied before compression, reversed after decompression.
BMP Vertical Delta: For 24/32-bit BMP images, subtracts each pixel row from the row above.
WAV Per-Channel Delta: For 16-bit PCM WAV files, subtracts the previous sample of the same channel. Uses 16-bit arithmetic to handle carry across byte boundaries.
Raw Image Auto-Detection: For files without recognized headers (like x-ray), scans candidate row widths (128–16384) and measures vertical byte correlation. If detected, applies a 16-bit vertical delta filter.

2. Prediction Models (52 Total)

Instead of a single algorithm, PMC uses an ensemble of experts:

9 Order-N Context Models: Orders 0, 1, 2, 3, 4, 6, 8, 12, 16. Each maps a hashed byte context to a state byte via finite-state counters.
34 Sparse Context Models: The "secret sauce" for binaries. They look at non-adjacent bytes (e.g., byte[i-4] and byte[i-8]) to find table columns and struct fields. Each uses a 4M-entry tagged state table (~8 MB).
Linguistic Models:
- Word Model: Rolling hash of current alphanumeric word.
- Word Trigram: Hash of previous two words + current position.
- Shadow Dictionary: Maintains unigram/bigram tables to predict the next word entirely.
Geometric Models:
- Stride/Record Model: Auto-detects data periodicity per block (e.g., stride=24 for ELF symbol tables) and predicts from blk[i - stride].
Correction Models:
- Match Model: Hash-chain match finder (depth 512 for text, 16 for binary).
- ICM (Indirect Context Model): Maps Order-6 hashes to adaptive probability counters.
- LZP (Lempel-Ziv Prediction): Order-4 context hash maps to a predicted next byte.

3. Mixing & Correction

Dual Adaptive Logistic Mixers: Mixer 1 uses bit-position and context; Mixer 2 uses bit context. Outputs are averaged in logit domain.
Text Block Gating: Blocks detected as text have their 34 sparse models, delta model, and stride model zeroed. This eliminates noise from binary-focused models.
3-Stage APM (Adaptive Probability Map): A post-processing stage that "learns the mixer's errors".
Word-Aware APM: (Text only) Learns per-word correction biases — e.g., if the mixer systematically mispredicts a silent letter, this stage learns the correction.
SSE (Secondary Symbol Estimation): Final polish using quantized probability and bit history.

4. Entropy Coding

rANS (Asymmetric Numeral Systems): Uses a 4-way interleaved binary rANS coder for high throughput and compression density close to Arithmetic Coding.

🔮 Version History & Extending

Version	Change	Dickens	ELF	BMP	WAV	X-ray
v3	Baseline	—	872 KB	—	—	—
v4	+34 sparse models	—	779 KB	1,734 KB	—	—
v4.1-4.3	+Filters (BCJ, BMP, WAV)	—	778 KB	1,607 KB	792 KB	—
v4.5	+Word-aware SSE	2,359 KB	778 KB	1,608 KB	791 KB	3,868 KB
v4.6	+Raw Image Auto-Detect	2,359 KB	778 KB	1,608 KB	791 KB	3,846 KB

Extending: The codebase is designed to be hackable. The sparse context models are the easiest way to improve compression. To add more, simply increment NUM_SPARSE and add a new byte-skip pattern in process_block().

Limitations:

Input size limited to 4 GB.
Memory usage is ~600 MB (dominated by sparse models).
Single-threaded; symmetric architecture means decompression is slow.

License

MIT License. See LICENSE for details.

Created by André Zaiats & Gemini (Google DeepMind) & Claude Opus 4.6 - February 2026

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
LICENSE		LICENSE
README.md		README.md
benchmark-out.txt		benchmark-out.txt
compressor.c		compressor.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PMC: The AI-Augmented Compression Experiment

⚡ Highlights

📊 Benchmarks & Results

1. Specialized Workloads (Where PMC Shines)

2. General Purpose (Silesia Corpus)

🛠 Usage

🧠 Under the Hood: Technical Architecture

1. Preprocessing & Auto-Detection

2. Prediction Models (52 Total)

3. Mixing & Correction

4. Entropy Coding

🔮 Version History & Extending

License

About

Uh oh!

Languages

License

andrezaiats/pmc-compressor

Folders and files

Latest commit

History

Repository files navigation

PMC: The AI-Augmented Compression Experiment

⚡ Highlights

📊 Benchmarks & Results

1. Specialized Workloads (Where PMC Shines)

2. General Purpose (Silesia Corpus)

🛠 Usage

🧠 Under the Hood: Technical Architecture

1. Preprocessing & Auto-Detection

2. Prediction Models (52 Total)

3. Mixing & Correction

4. Entropy Coding

🔮 Version History & Extending

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages