wip: accelerate tokenizator with simd by dkharms · Pull Request #365 · ozontech/seq-db

dkharms · 2026-02-28T10:13:34Z

Description

Something I wanted to try for a quite long time.

For now, this implementation does not handle case-sensitivity (at least it passes TestTokenizeSimple test) and does not handle non-ascii sequences.

And the most annoying thing -- CGO call overhead. For small strings it introduces a noticeable overhead (around 20ns to 40ns).

However, for long strings we get 20% speedup:

tokenizer · 0-simd-tokenizer± ⟩ benchstat pure.txt simd.txt
goos: linux
goarch: amd64
pkg: github.com/ozontech/seq-db/tokenizer
cpu: 12th Gen Intel(R) Core(TM) i5-12600K
                     │   pure.txt   │               simd.txt               │
                     │    sec/op    │    sec/op     vs base                │
Tokenize/short_24B      75.32n ± 2%   103.35n ± 2%  +37.21% (p=0.000 n=10)
Tokenize/medium_150B    265.4n ± 2%    270.9n ± 1%   +2.05% (p=0.019 n=10)
Tokenize/long_900B     1200.5n ± 8%    982.5n ± 1%  -18.15% (p=0.000 n=10)
geomean                 288.4n         301.9n        +4.65%

                     │   pure.txt   │               simd.txt               │
                     │     B/s      │     B/s       vs base                │
Tokenize/short_24B     303.9Mi ± 2%   221.5Mi ± 2%  -27.10% (p=0.000 n=10)
Tokenize/medium_150B   495.9Mi ± 2%   485.9Mi ± 1%   -2.01% (p=0.019 n=10)
Tokenize/long_900B     699.0Mi ± 8%   854.1Mi ± 1%  +22.19% (p=0.000 n=10)
geomean                472.3Mi        451.4Mi        -4.43%

So I guess the right move is to wait until SIMD API will be stabilized (hopefully, in Go 1.27) and rewrite logic using intrinsics Golang provides.

I have read and followed all requirements in CONTRIBUTING.md;
I used LLM/AI assistance to make this pull request;

If you have used LLM/AI assistance please provide model name and full prompt:

Model: {{model-name}}
Prompt: {{prompt}}

github-actions · 2026-02-28T10:13:52Z

❌ PR Title Validation Failed
Please refer to CONTRIBUTING.md

github-actions · 2026-02-28T10:14:06Z

❌ PR Title Validation Failed
Please refer to CONTRIBUTING.md

dkharms · 2026-02-28T10:15:37Z

@seqbenchbot list

seqbenchbot · 2026-02-28T10:15:40Z

Nice, @dkharms <(-^,^-)=b!

Here is the list of currently running benchmarks:

ID	Author	Scenario	Baseline	Comparison	Created
`74abf34c`	@eguguchkin	`bulk`	`main`	`47-active2`	2026-02-27 18:48:10

Have a great time!

dkharms · 2026-02-28T10:15:49Z

@seqbenchbot down 74abf34c

seqbenchbot · 2026-02-28T10:15:52Z

Nice, @dkharms <(-^,^-)=b!

The benchmark with identificator 74abf34c was finished.
I've prepared a summary for you. Click on Show summary button to see it:

Show summary

Query	Type	`mean (ms)`			`stddev (ms)`			`p(50) (ms)`			`p(95) (ms)`			`p(99) (ms)`			`iterations`
Query	Type	base	comp	diff	base	comp	diff	base	comp	diff	base	comp	diff	base	comp	diff	base	comp	diff
`bulk`	warm	`21.93`	`20.14`	`-8.19%`	`5.20`	`3.63`	`-30.15%`	`20.00`	`19.00`	`-5.00%`	`32.00`	`28.00`	`-12.50%`	`43.00`	`33.00`	`-23.26%`	`1336536.00`	`1335966.00`	`-0.04%`

Have a great time!

dkharms · 2026-02-28T10:16:15Z

@seqbenchbot up main bulk

seqbenchbot · 2026-02-28T10:16:18Z

Nice, @dkharms <(-^,^-)=b!

Your request was successfully served.
Identificator for your ongoing benchmark - fd66c3b4.

Here is a list of helpful links:

Take a look at Grafana dashboard;
Live-tailing logs are also available;

Have a great time!

dkharms · 2026-02-28T11:33:14Z

@seqbenchbot down fd66c3b4

seqbenchbot · 2026-02-28T11:33:17Z

Nice, @dkharms <(-^,^-)=b!

The benchmark with identificator fd66c3b4 was finished.
I've prepared a summary for you. Click on Show summary button to see it:

Show summary

Query	Type	`mean (ms)`			`stddev (ms)`			`p(50) (ms)`			`p(95) (ms)`			`p(99) (ms)`			`iterations`
Query	Type	base	comp	diff	base	comp	diff	base	comp	diff	base	comp	diff	base	comp	diff	base	comp	diff
`bulk`	warm	`44.23`	`44.39`	`+0.35%`	`17.05`	`16.82`	`-1.38%`	`40.00`	`40.00`	`0.00%`	`77.00`	`76.00`	`-1.30%`	`112.00`	`109.50`	`-2.23%`	`106431.00`	`106415.00`	`-0.02%`

Have a great time!

github-actions · 2026-02-28T11:35:03Z

❌ PR Title Validation Failed
Please refer to CONTRIBUTING.md

dkharms added 2 commits February 27, 2026 20:57

feat: initial impementation of vectorized processing

187a6d2

perf: reduce allocations

e1d18b5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wip: accelerate tokenizator with simd#365

wip: accelerate tokenizator with simd#365
dkharms wants to merge 2 commits intomainfrom
0-simd-tokenizer

dkharms commented Feb 28, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 28, 2026

Uh oh!

github-actions bot commented Feb 28, 2026

Uh oh!

dkharms commented Feb 28, 2026

Uh oh!

seqbenchbot commented Feb 28, 2026 •

edited

Loading

Uh oh!

dkharms commented Feb 28, 2026

Uh oh!

seqbenchbot commented Feb 28, 2026 •

edited

Loading

Uh oh!

dkharms commented Feb 28, 2026

Uh oh!

seqbenchbot commented Feb 28, 2026 •

edited

Loading

Uh oh!

dkharms commented Feb 28, 2026

Uh oh!

seqbenchbot commented Feb 28, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dkharms commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

github-actions bot commented Feb 28, 2026

Uh oh!

github-actions bot commented Feb 28, 2026

Uh oh!

dkharms commented Feb 28, 2026

Uh oh!

seqbenchbot commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dkharms commented Feb 28, 2026

Uh oh!

seqbenchbot commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dkharms commented Feb 28, 2026

Uh oh!

seqbenchbot commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dkharms commented Feb 28, 2026

Uh oh!

seqbenchbot commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dkharms commented Feb 28, 2026 •

edited

Loading

seqbenchbot commented Feb 28, 2026 •

edited

Loading

seqbenchbot commented Feb 28, 2026 •

edited

Loading

seqbenchbot commented Feb 28, 2026 •

edited

Loading

seqbenchbot commented Feb 28, 2026 •

edited

Loading