Skip to content

refactor(web): split tokenization realignment from evaluateTransition 🚂#15191

Merged
jahorton merged 2 commits intoepic/autocorrectfrom
refactor/web/realign-tokenization
Mar 16, 2026
Merged

refactor(web): split tokenization realignment from evaluateTransition 🚂#15191
jahorton merged 2 commits intoepic/autocorrectfrom
refactor/web/realign-tokenization

Conversation

@jahorton
Copy link
Copy Markdown
Contributor

@jahorton jahorton commented Nov 19, 2025

With the various ways that tokenizations can transition depending upon which potential inputs are applied, it's possible for multiple different tokenizations to transition into the same one. As such, there will no longer be "just one" way that a tokenization is reached. Accordingly, it's best to perform word-boundary realignment operations (splits, merges) separately from text-editing operations (inserts, deletes).

Fortunately, it's possible to enact this before multi-tokenization. It may even be advantageous to do so for clarity's sake - this makes clear which portions of the operations are for context word-boundary realignment and which are for actual context transition.

Build-bot: skip build:web
Test-bot: skip

@keymanapp-test-bot
Copy link
Copy Markdown

keymanapp-test-bot bot commented Nov 19, 2025

User Test Results

Test specification and instructions

User tests are not required

@keymanapp-test-bot keymanapp-test-bot bot changed the title refactor(web): split tokenization realignment from evaluateTransition refactor(web): split tokenization realignment from evaluateTransition 🚂 Nov 19, 2025
@keymanapp-test-bot keymanapp-test-bot bot added this to the A19S16 milestone Nov 19, 2025
@keyman-server keyman-server modified the milestones: A19S16, A19S17 Nov 22, 2025
@keyman-server keyman-server modified the milestones: A19S17, A19S18 Dec 6, 2025
@keyman-server keyman-server modified the milestones: A19S18, A19S19 Dec 21, 2025
@keyman-server keyman-server modified the milestones: A19S19, A19S20 Jan 3, 2026
@keyman-server keyman-server modified the milestones: A19S20, A19S21 Jan 16, 2026
@jahorton jahorton force-pushed the feat/web/search-space-node-propagation branch from b31bcad to c303355 Compare January 21, 2026 21:52
@jahorton jahorton force-pushed the feat/web/search-space-node-propagation branch 3 times, most recently from beafeb6 to 36df714 Compare January 30, 2026 21:06
@keyman-server keyman-server modified the milestones: A19S21, A19S22 Jan 31, 2026
@jahorton jahorton force-pushed the refactor/web/realign-tokenization branch from 49391d5 to 3473c6f Compare February 3, 2026 14:48
@jahorton jahorton force-pushed the feat/web/search-space-node-propagation branch from 36df714 to c2e0427 Compare February 5, 2026 19:43
@jahorton jahorton force-pushed the refactor/web/realign-tokenization branch from 3473c6f to 4f257f5 Compare February 5, 2026 19:44
@keyman-server keyman-server modified the milestones: A19S22, A19S23 Feb 13, 2026
@jahorton jahorton force-pushed the feat/web/search-space-node-propagation branch from c2e0427 to 3713e6a Compare March 4, 2026 18:20
@jahorton jahorton force-pushed the refactor/web/realign-tokenization branch from 4f257f5 to 3ef2d2a Compare March 4, 2026 18:25
@jahorton jahorton force-pushed the feat/web/search-space-node-propagation branch from 3713e6a to 0fa7e6b Compare March 5, 2026 12:54
@jahorton jahorton force-pushed the refactor/web/realign-tokenization branch from 3ef2d2a to 4450e14 Compare March 5, 2026 12:55
@jahorton jahorton force-pushed the feat/web/search-space-node-propagation branch 3 times, most recently from c5f9b66 to 91cf42e Compare March 10, 2026 16:41
@jahorton jahorton changed the base branch from feat/web/search-space-node-propagation to feat/web/test-quotient-specialized-splits March 11, 2026 16:04
@jahorton jahorton force-pushed the refactor/web/realign-tokenization branch 2 times, most recently from a4843aa to 7e07e77 Compare March 11, 2026 20:22
readonly transitionEdits?: {
addedNewTokens: boolean,
removedOldTokens: boolean,
// NOTE: slated for removal in an upcoming PR. Exists in this form to
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, #15727.

With the various ways that tokenizations can transition depending upon which potential inputs are applied, it's possible for multiple different tokenizations to transition into the same one. As such, there will no longer be "just one" way that a tokenization is reached.

Accordingly, it's best to perform word-boundary realignment operations (splits, merges) separately from text-editing operations (inserts, deletes).

Build-bot: skip build:web
Test-bot: skip
@jahorton jahorton force-pushed the refactor/web/realign-tokenization branch from 7e07e77 to 41beca3 Compare March 12, 2026 17:36
@jahorton jahorton changed the base branch from feat/web/test-quotient-specialized-splits to refactor/web/root-and-legacy-spur-tests March 12, 2026 17:36
@jahorton jahorton requested review from ermshiperete and mcdurdin and removed request for ermshiperete March 12, 2026 17:36
@jahorton jahorton marked this pull request as ready for review March 12, 2026 17:38
@jahorton jahorton requested a review from ermshiperete March 12, 2026 17:44

// Assumption: inputs.length > 0. (There is at least one input transform.)
const inputTransformKeys = [...inputs[0].sample.keys()];
const baseTailIndex = (tailTokenization.length - 1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be done after removing the tokens from tailTokenization? Otherwise baseTailIndex might point to an index that is no longer valid.

Copy link
Copy Markdown
Contributor Author

@jahorton jahorton Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is still correct. Inputs to be applied are tokenized elsewhere, and those tokens are indexed relative to this specific index - the location of the last pre-edit context token. For such cases, the 'first' (and possibly more!) such token index (as obtained from inputs[0].sample.keys() will be negative.

This is enforced in ContextTokenization.mapWhitespacedTokenization and .assembleTransforms, which together produce the key-values obtained by the block of code reviewed here.

…on/context-tokenization.ts

Co-authored-by: Eberhard Beilharz <ermshiperete@users.noreply.github.com>
@keyman-server keyman-server modified the milestones: A19S24, A19S25 Mar 14, 2026
Base automatically changed from refactor/web/root-and-legacy-spur-tests to epic/autocorrect March 16, 2026 13:12
@jahorton jahorton merged commit fa0c7ae into epic/autocorrect Mar 16, 2026
7 of 8 checks passed
@jahorton jahorton deleted the refactor/web/realign-tokenization branch March 16, 2026 13:12
@github-project-automation github-project-automation bot moved this from Todo to Done in Keyman Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants