-
Notifications
You must be signed in to change notification settings - Fork 0
feat(server): canonicalize transcripts at ingest v1 #618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
wileland
merged 3 commits into
develop
from
codex/implement-transcript-canonicalization-at-ingest
Feb 20, 2026
Merged
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,45 +1,55 @@ | ||
| { | ||
| "task_id": "phase0-spine-lockdown-2026-02-19", | ||
| "title": "Phase 0 Spine Lockdown: freeze contract vocab, kill ambiguous receipt offsets, harden emission + narrative policy", | ||
| "task_id": "phase1_ingest_canonicalization_2026_02_20", | ||
| "title": "Phase 1: Canonicalize transcript at ingest + stable transcriptHash", | ||
| "summary": "Implement ingest-time transcript canonicalization (NFKC + punctuation folding + line-ending normalization + BOM/null stripping) with versioning. Store rawTranscript + canonicalTranscript + transcriptHash + canonicalizationVersion on Entry for all write paths (upload route + GraphQL addEntry/updateEntry + any other transcript writers). Add deterministic tests for the canonicalization corpus. Do not bulk-migrate existing entries; freeze legacy entries at canonicalizationVersion=0/null and only apply v1 on new/updated transcripts going forward.", | ||
| "base_branch": "develop", | ||
| "branch_name": "codex/phase0-spine-lockdown-exec-2026-02-19", | ||
| "summary": "Seal the Meaning Spine by freezing contract reason codes, enforcing unique-match offset inference (ambiguity=poison), hardening validateReceipt (strict V1 never falls through), ensuring ENTRY_ANALYZED emits contract+sanitized cards only (no raw reflection text), and locking narrative toggle behind a shared policy utility that callers cannot override. Add/adjust regression tests to prevent drift.", | ||
| "branch_name": "codex/implement-transcript-canonicalization-at-ingest", | ||
| "repo_scope": [ | ||
| "codex/tasks/latest.json", | ||
| "server/models/Entry.js", | ||
| "server/routes/upload.js", | ||
| "server/graphql/resolvers/index.js", | ||
| "server/src/workers/scribe.worker.js", | ||
| "server/src/workers/reflection.worker.js", | ||
| "server/src/utils/truthValidator.js", | ||
| "server/src/utils/**", | ||
| "server/src/workers/__tests__/**", | ||
| "server/utils/**", | ||
| "server/models/__tests__/**", | ||
| "server/src/**/__tests__/**", | ||
| "server/tests/**", | ||
| "docs/testing-doctrine.md" | ||
| "server/__tests__/**", | ||
| "server/routes/__tests__/**", | ||
| "scripts/codex_preflight.mjs", | ||
| "codex/tasks/latest.json" | ||
| ], | ||
| "agents_involved": ["codex_web"], | ||
| "risk_level": "low", | ||
| "agents_involved": ["codex-web"], | ||
| "risk_level": "medium", | ||
| "tests_to_run": [ | ||
| "node -e \"JSON.parse(require('fs').readFileSync('codex/tasks/latest.json','utf8')); console.log('latest.json ok')\"", | ||
| "node scripts/codex_preflight.mjs --ci", | ||
| "pnpm -C server test" | ||
| ], | ||
| "constraints": [ | ||
| "CODEX_WEB: Do NOT run git network commands (no git fetch/pull/push/clone). Use the UI “Create PR” button if a PR is needed.", | ||
| "CODEX_WEB_HEAD: In Codex Web, the checked-out branch name may be 'work'. Do NOT treat HEAD name mismatch as stale. Locks+canary are the source of truth.", | ||
| "ANTI-COP-OUT: No diff => no PR. If no actionable work exists, stop and report evidence.", | ||
| "SCOPE: Do not modify files outside repo_scope. If out-of-scope issues are found, produce a Repair Manifest instead of changing them.", | ||
| "ALIGNMENT: Print task_id/base_branch/branch_name/canary from latest.json before doing any work.", | ||
| "EVIDENCE_BUNDLE: Provide evidence in 4 phases: Alignment, Work-Exists Gate, Change Proof, Tests.", | ||
| "PR_BASE: Ensure PR base branch is develop (not another codex/* branch). Do not create draft PRs.", | ||
| "NO_PLACEHOLDERS: Do not create empty directories or placeholder files. Only create files with real content and tests.", | ||
| "NO_NETWORK: Tests must not touch real external network services." | ||
| "Codex Web environment: do NOT run git push; use the Create PR button.", | ||
| "Do NOT create placeholder files or empty directories. If no diff is needed, stop and report; do not create a PR.", | ||
| "All changes must remain within repo_scope. If a necessary fix is out-of-scope, produce a Repair Manifest instead of changing it.", | ||
| "Canonicalization happens at ingest/write time only (identity). Do not re-canonicalize during validation except legacy v0 fallback.", | ||
| "Do NOT bulk-migrate existing stored transcripts. Implement freeze+version: legacy entries are v0/null; new writes become v1.", | ||
| "Hashing must be based on canonicalTranscript and must NOT use locale-sensitive casefolding (no toLowerCase/toUpperCase on hash inputs).", | ||
| "No raw user transcript content may be logged or emitted into events as part of this change." | ||
| ], | ||
| "acceptance_checks": [ | ||
| "Alignment Evidence: show codex/tasks/latest.json values for task_id, base_branch, branch_name, and canary.", | ||
| "Alignment Evidence: print `git rev-parse --abbrev-ref HEAD` and `git rev-parse HEAD` for evidence; do NOT stop on SHA mismatch.", | ||
| "Work-Exists Gate: prove target symbols exist via grep or file navigation; if not found, stop and report: findReceiptOffsets (or equivalent), emitEntryAnalyzed callsite/payload, sanitizeBloomCardsWithContract boundary, validateReceipt in server/src/utils/truthValidator.js (or its imported helpers).", | ||
| "Freeze contract reason codes: add a shared constants module and replace raw string comparisons/assignments in Meaning Spine paths touched by this task.", | ||
| "Unique Match Rule: any transcript-search offset inference must return null on ambiguous multi-occurrence matches (firstIndex !== lastIndex). Ambiguity must drop the receipt/card safely and be reflected in contract/dropped reasons.", | ||
| "validateReceipt hardening: strict V1 path must not fall through to weaker matching if offsets fail; invalid shapes return explicit failure reasons and do not throw.", | ||
| "Emission hardening: ENTRY_ANALYZED payload must contain sanitized cards AND the Meaning Contract ledger; payload must not include raw reflection text anywhere.", | ||
| "Tests: add/adjust regression tests that fail if raw model output leaks into emission serialization; add/adjust tests verifying ambiguous quote matches are dropped.", | ||
| "Proof: include git status -sb and git diff --stat after changes; run tests_to_run and report results. (Run `pnpm -w test` locally after PR if desired.)" | ||
| "Alignment Evidence: print task_id, base_branch, branch_name, repo_scope, tests_to_run at start of run.", | ||
| "Work-Exists Gate: identify all transcript write paths (upload.js, GraphQL addEntry/updateEntry, scribe worker transcript persistence) and show exact files/lines to be changed.", | ||
| "Implement a single ingest canonicalization function (v1) using NFKC + punctuation folding + newline normalization + BOM/null stripping + internal whitespace folding (preserve newlines) + trim; store canonicalizationVersion='1'.", | ||
| "Entry stores rawTranscript (untouched) and canonicalTranscript (canonicalized). transcriptHash is sha256(canonicalTranscript).", | ||
| "All transcript-writing paths set/update canonical fields consistently when transcript changes.", | ||
| "Add/extend deterministic tests covering: smart quotes folding, dash folding, ellipsis folding, CRLF/CR normalization, BOM/null stripping, internal whitespace folding (tabs/multi-spaces without breaking newlines), and idempotency (canon(canon(x))==canon(x)).", | ||
| "Run tests_to_run and show outputs. If any test is skipped, explain why and provide a safe alternative.", | ||
| "Change Proof: show git status -sb and git diff --stat at end. No diff => no PR." | ||
| ], | ||
| "canary": "CANARY_PHASE0_SPINE_LOCKDOWN_2026_02_19" | ||
| "locks": { | ||
| "task_id": "phase1_ingest_canonicalization_2026_02_20", | ||
| "base_branch": "develop", | ||
| "branch_name": "codex/implement-transcript-canonicalization-at-ingest", | ||
| "canary": "PHASE1_INGEST_CANON_V1_CANARY_2026_02_20" | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.