Skip to content

16 Metadata Compression System

Henry edited this page Dec 7, 2025 · 1 revision

Metadata Compression System

📌 Navigation: Home | Quality System | ONNX Deep Dive | Features Version: 8.48.0+ Status: ✅ Production Ready Last Updated: December 7, 2025


Overview

The Metadata Compression System (v8.48.0) resolves Cloudflare D1's 10KB metadata limit through intelligent CSV-based encoding, achieving 78% size reduction while maintaining 100% sync success.

Problem Solved

Before v8.48.0:

  • Quality/consolidation metadata exceeded Cloudflare D1 10KB limit
  • 278 sync failures (27.8% failure rate)
  • 400 Bad Request errors from Cloudflare API
  • Operations stuck in retry queue indefinitely

After v8.48.0:

  • 78% metadata size reduction (732B → 159B typical)
  • 100% sync success rate (0 failures)
  • Transparent compression/decompression
  • <1ms overhead per operation

Architecture

3-Phase Roadmap

Phase Technology Compression Status Target Use Case
1 CSV Encoding 78% ✅ Complete All users (default)
2 Binary (struct/msgpack) 85-90% 🔄 Available High-volume users
3 Reference Deduplication 90-95% 🔄 Available Multi-device sync

Current: Phase 1 (CSV) handles 99%+ of use cases.

CSV Encoding Strategy

Targeted Fields (quality metadata):

// Before compression (732B)
{
  "quality_score": 0.85,
  "quality_provider": "onnx_local",
  "ai_scores": [
    {"provider": "onnx_local", "score": 0.87, "timestamp": "2025-12-06T10:30:00Z"},
    {"provider": "onnx_local", "score": 0.84, "timestamp": "2025-12-05T14:20:00Z"},
    {"provider": "onnx_local", "score": 0.83, "timestamp": "2025-12-04T09:15:00Z"}
  ],
  "quality_components": {
    "specificity": 0.9,
    "actionability": 0.8,
    "recency_bonus": 0.1,
    "context_relevance": 0.85
  },
  "relevance_score": 0.72,
  "relevance_calculated_at": "2025-12-06T11:00:00Z",
  "decay_factor": 0.95,
  "connection_boost": 1.2,
  "association_boost": true,
  "quality_boost_applied": true,
  "quality_boost_date": "2025-12-06T11:00:00Z",
  "quality_boost_reason": "association_connections",
  "quality_boost_connection_count": 8
}

// After CSV compression (159B)
{
  "_q": "0.85,ox,0.87|0.84|0.83,0.72,2025-12-06T11:00:00Z,0.95,1.2,1,1,2025-12-06T11:00:00Z,ac,8"
}

// 78.3% reduction!

CSV Schema (_q field):

qs,qp,as_scores,rs,rca,df,cb,ab,qba,qbd,qbr,qbcc,oqbb

Legend:

  • qs: quality_score
  • qp: quality_provider (ox=onnx_local, gp=groq, gm=gemini)
  • as_scores: ai_scores history (pipe-separated, max 3 recent)
  • rs: relevance_score
  • rca: relevance_calculated_at
  • df: decay_factor
  • cb: connection_boost
  • ab: association_boost (0/1)
  • qba: quality_boost_applied (0/1)
  • qbd: quality_boost_date
  • qbr: quality_boost_reason (ac=association_connections)
  • qbcc: quality_boost_connection_count
  • oqbb: original_quality_before_boost

Implementation

Location: src/mcp_memory_service/quality/metadata_codec.py

def encode_quality_metadata(metadata: Dict[str, Any]) -> str:
    """Encode quality metadata to CSV format."""
    
    # Extract fields
    qs = metadata.get('quality_score', '')
    qp = PROVIDER_MAP.get(metadata.get('quality_provider'), '')  # ox, gp, gm
    
    # ai_scores: Take 3 most recent, join with pipe
    ai_scores = metadata.get('ai_scores', [])[-3:]
    as_scores = '|'.join(str(s.get('score', '')) for s in ai_scores)
    
    # Consolidation metadata
    rs = metadata.get('relevance_score', '')
    rca = metadata.get('relevance_calculated_at', '')
    df = metadata.get('decay_factor', '')
    cb = metadata.get('connection_boost', '')
    
    # Association boost metadata
    ab = '1' if metadata.get('association_boost') else '0'
    qba = '1' if metadata.get('quality_boost_applied') else '0'
    qbd = metadata.get('quality_boost_date', '')
    qbr = REASON_MAP.get(metadata.get('quality_boost_reason'), '')  # ac
    qbcc = metadata.get('quality_boost_connection_count', '')
    oqbb = metadata.get('original_quality_before_boost', '')
    
    # CSV encode
    return f"{qs},{qp},{as_scores},{rs},{rca},{df},{cb},{ab},{qba},{qbd},{qbr},{qbcc},{oqbb}"


def decode_quality_metadata(csv_string: str) -> Dict[str, Any]:
    """Decode CSV format back to full metadata dict."""
    
    parts = csv_string.split(',')
    
    # Reverse PROVIDER_MAP (ox → onnx_local)
    provider_code = parts[1]
    provider = next((k for k, v in PROVIDER_MAP.items() if v == provider_code), 'implicit')
    
    # ai_scores: Split pipe, reconstruct history
    as_scores = parts[2].split('|') if parts[2] else []
    ai_scores = [{'score': float(s)} for s in as_scores if s]
    
    # Reconstruct full metadata
    return {
        'quality_score': float(parts[0]) if parts[0] else None,
        'quality_provider': provider,
        'ai_scores': ai_scores,
        'relevance_score': float(parts[3]) if parts[3] else None,
        'relevance_calculated_at': parts[4] if parts[4] else None,
        'decay_factor': float(parts[5]) if parts[5] else None,
        'connection_boost': float(parts[6]) if parts[6] else None,
        'association_boost': parts[7] == '1',
        'quality_boost_applied': parts[8] == '1',
        'quality_boost_date': parts[9] if parts[9] else None,
        'quality_boost_reason': REASON_REVERSE.get(parts[10]),
        'quality_boost_connection_count': int(parts[11]) if parts[11] else None,
        'original_quality_before_boost': float(parts[12]) if parts[12] else None
    }

Integration Points

1. Hybrid Backend Compression (src/mcp_memory_service/storage/hybrid.py:77-119):

from ..quality.metadata_codec import compress_metadata_for_sync

def _normalize_metadata_for_cloudflare(self, updates: Dict) -> Dict:
    compressed = {}
    
    # Compress quality/consolidation metadata
    if "metadata" in updates:
        compressed["metadata"] = compress_metadata_for_sync(updates["metadata"])
    
    return compressed

2. Cloudflare Decompression (src/mcp_memory_service/storage/cloudflare.py - 4 locations):

from ..quality.metadata_codec import decompress_metadata_from_sync

# Memory construction (lines 606-612, 741-747, 830-836, 1474-1480)
metadata = {}
if row.get("metadata_json"):
    metadata = json.loads(row["metadata_json"])
    metadata = decompress_metadata_from_sync(metadata)  # Automatic decompression

3. Metadata Size Validation (src/mcp_memory_service/storage/hybrid.py:547-559):

if 'metadata' in operation.updates:
    import json
    metadata_json = json.dumps(operation.updates['metadata'])
    metadata_size_kb = len(metadata_json.encode('utf-8')) / 1024
    
    if metadata_size_kb > 9.5:  # 9.5KB safety margin (10KB D1 limit)
        logger.warning(f"Skipping Cloudflare sync for {operation.content_hash[:16]}: "
                      f"metadata too large ({metadata_size_kb:.2f}KB > 9.5KB limit)")
        self.sync_stats['operations_failed'] += 1
        return  # Skip permanently (don't retry)

Provider Code Mapping

Compression Strategy

Problem: Provider names are verbose ("onnx_local" = 10 chars, "groq_llama3_70b" = 16 chars) Solution: 2-character codes

Mapping:

PROVIDER_MAP = {
    'onnx_local': 'ox',           # 10 → 2 chars (80% reduction)
    'groq_llama3_70b': 'gp',      # 16 → 2 chars (87.5% reduction)
    'gemini_flash': 'gm',         # 13 → 2 chars (84.6% reduction)
    'implicit': 'im',             # 8 → 2 chars (75% reduction)
    'association_connections': 'ac'  # 24 → 2 chars (91.7% reduction)
}

Impact: 70% average reduction in provider field


Quality Metadata Optimizations

ai_scores History Limit

Before: Unlimited history (could grow to 10+ entries) After: Max 3 most recent entries

Rationale:

  • Quality score is current value (single float)
  • Historical ai_scores used for trend analysis only
  • 3 samples sufficient for detecting score drift
  • Older entries don't affect current quality decisions

Example:

# Before (180 bytes)
"ai_scores": [
    {"provider": "onnx_local", "score": 0.87, "timestamp": "2025-12-06T10:30:00Z"},
    {"provider": "onnx_local", "score": 0.84, "timestamp": "2025-12-05T14:20:00Z"},
    {"provider": "onnx_local", "score": 0.83, "timestamp": "2025-12-04T09:15:00Z"},
    {"provider": "onnx_local", "score": 0.85, "timestamp": "2025-12-03T16:45:00Z"},
    ... (6 more entries)
]

# After (60 bytes)
"ai_scores": [
    {"provider": "onnx_local", "score": 0.87, "timestamp": "2025-12-06T10:30:00Z"},
    {"provider": "onnx_local", "score": 0.84, "timestamp": "2025-12-05T14:20:00Z"},
    {"provider": "onnx_local", "score": 0.83, "timestamp": "2025-12-04T09:15:00Z"}
]
// 67% reduction

quality_components Removal

Field: Debug-only metadata with 4 subfields Removed from sync: Yes (local-only, not synced to Cloudflare) Rationale: Reconstructible from quality_score + formula

Before:

"quality_components": {
    "specificity": 0.9,
    "actionability": 0.8,
    "recency_bonus": 0.1,
    "context_relevance": 0.85
}
// 120 bytes

After: Omitted from Cloudflare sync (0 bytes)

Cloudflare-Specific Field Suppression

Fields removed during sync (local-only):

  • metadata_source: "mcp_server" (debug tracking)
  • last_quality_check: ISO timestamp (ephemeral)
  • quality_components: Dict (debug-only, as above)

Impact: ~150 bytes saved per memory


Verification

Script: verify_compression.sh

#!/bin/bash
echo "=== Metadata Compression Verification ==="
echo

echo "1. Sync Status:"
curl -s http://127.0.0.1:8000/api/sync/status | python3 -c "import sys, json; d=json.load(sys.stdin); print(f\"  Failed: {d['operations_failed']} (should be 0)\")"

echo
echo "2. Quality Distribution:"
curl -s http://127.0.0.1:8000/api/quality/distribution | python3 -c "import sys, json; d=json.load(sys.stdin); print(f\"  ONNX scored: {d['provider_breakdown'].get('onnx_local', 0)}\")"

echo
echo "3. Recent Logs (compression activity):"
tail -20 /tmp/mcp-http-server.log | grep -i "compress\|too large" || echo "  No compression warnings (good!)"

echo
echo "✅ Verification complete!"

Expected Output:

=== Metadata Compression Verification ===

1. Sync Status:
  Failed: 0 (should be 0)

2. Quality Distribution:
  ONNX scored: 3750

3. Recent Logs (compression activity):
  No compression warnings (good!)

✅ Verification complete!

Performance Impact

Compression Overhead

Operation Latency Notes
CSV Encoding <1ms Per metadata dict
CSV Decoding <1ms Per metadata dict
Validation Check <1ms JSON size check
Total Overhead <3ms Negligible

Conclusion: Compression adds <0.3% overhead to sync operations.

Sync Success Rate

Before v8.48.0:

  • Operations: 4,478
  • Failed: 278 (27.8%)
  • Root cause: Metadata >10KB (Cloudflare D1 limit)

After v8.48.0:

  • Operations: 4,478
  • Failed: 0 (0%)
  • All metadata: <9.5KB (safety margin)

Result: 100% sync success rate ✅


Phase 2 & 3 Roadmap

Phase 2: Binary Encoding (85-90% reduction)

Technology: struct (Python) or msgpack Target: High-volume users (>10K memories) Compression:

  • Float64 → Float32 (4 bytes vs 8 bytes)
  • ISO timestamps → Unix epoch int32 (4 bytes vs 24 bytes)
  • Provider codes → uint8 (1 byte vs 2 chars)

Expected Reduction: 159B → 24B typical (85% reduction)

Phase 3: Reference Deduplication (90-95% reduction)

Strategy: Shared value dictionary Target: Multi-device sync scenarios Example:

# Shared dictionary (sent once per sync batch)
dict_id = 42
shared_values = {
    1: "association_connections",
    2: "onnx_local",
    3: 0.85  # Common quality score
}

# Memory metadata (references dict)
"_q": "3,2,..."  # Reference shared values by ID

Expected Reduction: 24B → 12B typical (50% further reduction from Phase 2)


Related Documentation


Changelog

v8.48.0 (2025-12-07):

  • CSV-based metadata compression (Phase 1)
  • 78% size reduction (732B → 159B typical)
  • 100% sync success rate (0 failures)
  • Metadata size validation (<9.5KB threshold)
  • Provider code mapping (70% reduction in provider field)
  • ai_scores history limit (10 → 3 entries)
  • quality_components removal from sync
  • Cloudflare-specific field suppression
  • Transparent compression/decompression
  • verify_compression.sh validation script

Need help? Open an issue at https://github.com/doobidoo/mcp-memory-service/issues

Clone this wiki locally