refactor(web): split tokenization realignment from evaluateTransition 🚂#15191
refactor(web): split tokenization realignment from evaluateTransition 🚂#15191jahorton merged 2 commits intoepic/autocorrectfrom
Conversation
User Test ResultsTest specification and instructions User tests are not required |
b31bcad to
c303355
Compare
beafeb6 to
36df714
Compare
49391d5 to
3473c6f
Compare
36df714 to
c2e0427
Compare
3473c6f to
4f257f5
Compare
c2e0427 to
3713e6a
Compare
4f257f5 to
3ef2d2a
Compare
3713e6a to
0fa7e6b
Compare
3ef2d2a to
4450e14
Compare
c5f9b66 to
91cf42e
Compare
a4843aa to
7e07e77
Compare
| readonly transitionEdits?: { | ||
| addedNewTokens: boolean, | ||
| removedOldTokens: boolean, | ||
| // NOTE: slated for removal in an upcoming PR. Exists in this form to |
With the various ways that tokenizations can transition depending upon which potential inputs are applied, it's possible for multiple different tokenizations to transition into the same one. As such, there will no longer be "just one" way that a tokenization is reached. Accordingly, it's best to perform word-boundary realignment operations (splits, merges) separately from text-editing operations (inserts, deletes). Build-bot: skip build:web Test-bot: skip
7e07e77 to
41beca3
Compare
web/src/engine/predictive-text/worker-thread/src/main/correction/context-tokenization.ts
Outdated
Show resolved
Hide resolved
|
|
||
| // Assumption: inputs.length > 0. (There is at least one input transform.) | ||
| const inputTransformKeys = [...inputs[0].sample.keys()]; | ||
| const baseTailIndex = (tailTokenization.length - 1); |
There was a problem hiding this comment.
Shouldn't this be done after removing the tokens from tailTokenization? Otherwise baseTailIndex might point to an index that is no longer valid.
There was a problem hiding this comment.
No, this is still correct. Inputs to be applied are tokenized elsewhere, and those tokens are indexed relative to this specific index - the location of the last pre-edit context token. For such cases, the 'first' (and possibly more!) such token index (as obtained from inputs[0].sample.keys() will be negative.
This is enforced in ContextTokenization.mapWhitespacedTokenization and .assembleTransforms, which together produce the key-values obtained by the block of code reviewed here.
…on/context-tokenization.ts Co-authored-by: Eberhard Beilharz <ermshiperete@users.noreply.github.com>
With the various ways that tokenizations can transition depending upon which potential inputs are applied, it's possible for multiple different tokenizations to transition into the same one. As such, there will no longer be "just one" way that a tokenization is reached. Accordingly, it's best to perform word-boundary realignment operations (splits, merges) separately from text-editing operations (inserts, deletes).
Fortunately, it's possible to enact this before multi-tokenization. It may even be advantageous to do so for clarity's sake - this makes clear which portions of the operations are for context word-boundary realignment and which are for actual context transition.
Build-bot: skip build:web
Test-bot: skip