The core library compiles, all 37 tests pass, and the three-phase selection pipeline (score → order → budget) is fully operational. The document model, cache builder, and selection engine match the normative specs. Dead code and duplicates have been cleaned up. All spec compliance gaps have been resolved (either by code fix or spec clarification). What remains is duplicate test consolidation, edge case coverage, and a standalone cache verification function.
-
Documentstruct withid,version,source,content,metadata - Single constructor
Document::ingest()enforcing all invariants - UTF-8 validation at ingestion (rejects invalid bytes)
- Content-addressed versioning:
sha256:<hex>from content bytes only - Metadata excluded from version computation
- No newline normalization (CRLF vs LF produce different versions)
-
DocumentId::from_path()with normalization: lowercase, forward slashes, no./prefix -
DocumentVersion::from_content()using SHA-256 -
Metadatabacked byBTreeMap(sorted iteration for determinism) -
MetadataValuesupportsStringandNumber(i64)only (flat, no nesting) -
Metadata::merge()with caller-provided precedence
-
CacheBuilder::build()— single-threaded, non-reentrant cache construction - Documents sorted by ID before processing (determinism)
- Duplicate document ID detection after sorting (adjacent-pair check, fatal error)
- Filename: first 12 chars of SHA-256 hash (without
sha256:prefix) - Filename collision detection (fatal error)
- Cache version:
sha256(config_json + sorted("doc_id:doc_version"))—created_atexcluded - Atomic writes: temp dir → rename (all-or-nothing)
- Stale temp dir cleanup from previous crashed runs
-
manifest.json(pretty-printed, sorted documents) -
index.json(pretty-printed,BTreeMapensures sorted keys,#[serde(transparent)]for flat map format) -
documents/{hash}.jsonper document -
ContextCache— thin read-only runtime wrapper -
load_documents()— loads from manifest entries, verifies ID matches, verifies version (recomputes content hash against manifest) - Rejects build if output directory already exists
- Three-phase pipeline: score → order → budget
-
TermFrequencyScorer— naive term frequency:term_matches / total_words - Query normalization: lowercase + whitespace split
- Scoring is pure (no side effects, no randomness)
- Sort: score descending, document ID ascending (deterministic tie-break)
-
debug_assert!verifying sorted order invariant -
ApproxTokenCounter—ceil(len / 4)approximation - Greedy budget filling: documents added in order, never truncated
- Zero budget → empty selection
- Score 0.0 documents MAY be selected (no score-based exclusion in v0)
-
ScorerandTokenCountertraits for future extensibility -
SelectionResultoutput withdocuments+selectionmetadata -
SelectionWhyexplainability:query_terms,term_matches,total_words
-
Query— normalized query withraw+terms -
SelectedDocument— final output document with score, tokens, why -
SelectionMetadata— query, budget, tokens_used, counts -
SelectionResult— top-level result container -
ScoredDocument— internal reference-based scored document (avoids premature cloning) -
ScoreDetails— internal scoring components -
SelectionError—InvalidBudget,CacheError -
DocumentId,DocumentVersion— identity and versioning types
-
DocumentError—InvalidUtf8 -
DocumentIdError—OutsideRoot,InvalidUtf8 -
CacheBuildError—Io,Serialization,OutputExists,FilenameCollision,InvalidVersionFormat,DuplicateDocumentId -
SelectionError—InvalidBudget,CacheError
- Removed duplicate
selection/selector.rs(inlined copy ofselection/mod.rslogic) - Removed duplicate
selection/types.rs(copy oftypes/context_bundle.rs) - Removed duplicate
document/id.rsanddocument/version.rs(copies oftypes/identifiers.rs) - Removed 6 legacy re-export shims (
cache/builder.rs,cache/manifest.rs,cache/config.rs,cache/index.rs,selection/scorer.rs,selection/tokenizer.rs) - Removed migrated
mcp/module (MCP error types live inmcp-context-server) - Removed unused
errors.rs(CoreErrorwrapper had no consumers) - Removed empty
tests/mcp_error_schema.rs
-
CacheIndexserialization: added#[serde(transparent)]soindex.jsonserializes as flat map (was wrapped in{"entries": {...}}) - Duplicate document ID detection:
CacheBuilder::build()now rejects duplicate IDs after sorting - Document version verification on load:
load_documents()recomputes content hash and compares against manifest entry -
document_model.md: metadata extraction and frontmatter parsing scoped to post-v0 -
context_selection.md: output schema aligned to normativecontext.resolve.md -
context_selection.md: removeddocuments_excluded_by_score(v0 doesn't exclude by score) -
milestone_zero.md: output contract fixed (metadata→selection, added missing fields, referenced normative spec) -
milestone_zero.md: changed false "provenance" claim to "version and scoring explanation"
-
cache_invariants.rs(2) — index key sorting, filename collision detection -
cache_lifecycle.rs(10) — determinism, config changes, UTF-8, manifest bytes, corruption, ID normalization, metadata isolation, newline handling, metadata precedence -
cache_manifest.rs(2) — version determinism, config change effects -
context_selection.rs(3) — zero budget, sorting order, tie-breaking -
determinism.rs(4) — manifest serialization, document serialization, selection output, end-to-end determinism -
document_model.rs(6) — document invariants -
end_to_end_golden.rs(2) — manifest byte comparison, corruption detection -
golden_selection_contract.rs(1) — selection output structure validation -
golden_selection_logic.rs(1) — end-to-end selection determinism -
golden_serialization.rs(2) — manifest and document serialization snapshots -
selection_invariants.rs(1) — token bounds, scores, content accuracy -
selection_logic.rs(3) — budget constraints, sorting, tie-breaking
-
Cache verification function —
context_cache.mdspecifies a verification operation that checks:- Manifest exists and is valid JSON
- Cache version matches recomputed hash
- Every document file exists
- Every document file hash matches its filename
- No orphan files in
documents/
No standalone
verify_cache()function exists. Individual checks are partially covered byload_documents()(checks 1, 3, 4 via version verification) but there is no single function that runs all 5 checks and reports results. Needed by both the CLIinspect --verifyand MCPinspect_cachetool.
-
DocumentSourcetrait +RawDocumenttype — Define connector interface indocument::sourcemodule. All enterprise connectors implement this trait.RawDocumentcarries pre-ingestion content + metadata. -
ConnectorErrortype — Error variants:AuthenticationFailed,FetchFailed,InvalidContent,PartialFetch. - Canonicalization utilities —
document::canonicalizemodule: line ending normalization, trailing whitespace trimming, trailing empty line removal, Unicode NFC normalization. Deterministic ordering of all transforms. -
FilesystemSourcereference connector — Migrate existing walkdir-based ingestion toDocumentSourcetrait. Must produce byte-identical caches to currentbuildpath. -
ingest_from_source()pipeline — Orchestrates:source.fetch_documents()→ UTF-8 validation →Document::ingest(). Configurable error policy (skip-and-warn vs abort-all). -
unicode-normalizationdependency — Add withdefault-features = falsefor NFC normalization.
-
Duplicate test consolidation — Several test files contain identical or near-identical tests:
cache_lifecycle.rsanddocument_model.rsshare 5+ identical testscache_manifest.rsduplicates 2 tests fromcache_lifecycle.rscontext_selection.rsandselection_logic.rscontain the same 3 testsdeterminism.rsandgolden_serialization.rsshare testsend_to_end_golden.rsduplicates tests fromcache_lifecycle.rs
Consider consolidating to avoid maintenance burden and test confusion.
-
Cache rebuild determinism — No test verifies that building a cache twice from the same documents produces byte-identical
manifest.json(thecreated_attimestamp will differ). Thecache_versionfield will match, but the full file will not. This is spec-correct (created_atis informational) but should be explicitly tested. -
Duplicate document ID test — No test exercises the new
DuplicateDocumentIderror path. -
Version verification test — No test exercises the version mismatch detection in
load_documents()(e.g., corrupt a document file after build, verify load fails). -
Edge cases not covered:
- Empty document set (build cache with 0 documents)
- Single document cache
- Very large document (multi-MB content)
- Document with empty content (
"") - Query with special characters, punctuation
- Budget of 1 (smaller than any document)
- All documents have score 0.0
-
context inspectsupport — Expose aninspect_cache()function returning cache metadata (document count, total size, cache version, validity). Needed by the MCPinspect_cachetool and CLI. -
Cache rebuild (force) —
CacheBuilderrejects existing output dirs. Arebuild()method or--forceequivalent that removes and rebuilds would match the spec's rebuild command. -
DeserializeforQuery—QueryderivesCloneandDebugbut notDeserialize. Adding it would allow JSON deserialization of queries (useful for test fixtures). -
Document field ordering guarantee — Spec says documents are serialized with fixed field order (
id,version,source,content,metadata). Serde's default struct serialization preserves declaration order, which matches the spec. But this is implicit — a#[serde(rename_all)]or field reorder would silently break it. Consider adding a golden test that asserts field order explicitly.
| # | Issue | Resolution |
|---|---|---|
| 1 | documents_excluded_by_score in selection output |
Removed from context_selection.md; context.resolve.md is normative and doesn't include it |
| 2 | metadata vs selection key in output |
Updated milestone_zero.md to use "selection" |
| 3 | cache_version in output |
Updated milestone_zero.md to match normative spec (no cache_version) |
| 4 | Automatic metadata extraction scope | Deferred to post-v0 in document_model.md |
| 5 | MCP error types — single source of truth | Deleted from context-core; MCP types live in mcp-context-server |
context-core/
├── Cargo.toml
├── progress.md ← this file
├── spec_refs.md
├── src/
│ ├── lib.rs module declarations
│ │
│ ├── types/
│ │ ├── mod.rs re-exports
│ │ ├── identifiers.rs DocumentId, DocumentVersion
│ │ └── context_bundle.rs Query, SelectionResult, etc.
│ │
│ ├── document/
│ │ ├── mod.rs re-exports
│ │ ├── document.rs Document struct + ingest()
│ │ ├── metadata.rs Metadata, MetadataValue
│ │ └── parser.rs placeholder (future parsing hooks)
│ │
│ ├── cache/
│ │ ├── mod.rs re-exports
│ │ ├── cache.rs ContextCache (runtime read-only wrapper)
│ │ ├── versioning.rs CacheManifest, CacheBuildConfig, CacheIndex
│ │ └── invalidation.rs CacheBuilder (build logic)
│ │
│ ├── selection/
│ │ ├── mod.rs ContextSelector + three-phase pipeline
│ │ ├── ranking.rs Scorer, TermFrequencyScorer, TokenCounter
│ │ ├── budgeting.rs apply_budget (greedy selection)
│ │ └── filters.rs placeholder (future filtering)
│ │
│ └── compression/
│ ├── mod.rs module declaration
│ └── summarizer.rs placeholder (future compression)
│
└── tests/
├── cache_invariants.rs 2 tests — index sorting, collision
├── cache_lifecycle.rs 10 tests — determinism, config, corruption
├── cache_manifest.rs 2 tests — version determinism, config changes
├── context_selection.rs 3 tests — budget, sorting, ties
├── determinism.rs 4 tests — serialization + e2e determinism
├── document_model.rs 6 tests — document invariants
├── end_to_end_golden.rs 2 tests — manifest bytes, corruption
├── golden_selection_contract.rs 1 test — output structure
├── golden_selection_logic.rs 1 test — e2e selection determinism
├── golden_serialization.rs 2 tests — serialization snapshots
├── selection_invariants.rs 1 test — bounds + explainability
└── selection_logic.rs 3 tests — budget, sorting, ties
[dependencies]
sha2 = "0.10" # SHA-256 hashing
hex = "0.4" # Hex encoding
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
thiserror = "1.0" # Error derive macros
chrono = { version = "0.4", features = ["serde", "clock"], default-features = false } # created_at timestamps
[dev-dependencies]
tempfile = "3.24.0"