Fix BPE merge parsing to handle HuggingFace tokenizer.json format #159

mergennachin · 2025-12-08T21:34:27Z

The BPE merge parsing code incorrectly assumed merges were arrays of
two elements (["a", "b"]), but HuggingFace tokenizer.json uses
space-separated strings ("a b") as the standard format.

This fix:

Adds support for legacy string format: "token1 token2" (standard HF format)
Keeps support for tuple array format: ["token1", "token2"] (for tokens with spaces)
Skips #version header lines (matching HuggingFace Rust tokenizers behavior)

The implementation follows the HuggingFace Rust tokenizers library
(huggingface/tokenizers) which handles both formats in
tokenizers/src/models/bpe/serialization.rs.

Added tests for both merge formats to verify correct parsing.

Test Plan:

mkdir build-test
cd build-test
cmake ../test
cmake --build . --target test_hf_tokenizer
cd ..
RESOURCES_PATH="test/resources" ./build-test/test_hf_tokenizer

meta-codesync · 2025-12-08T21:57:20Z

@mergennachin has imported this pull request. If you are a Meta employee, you can view this in D88675986.

Summary: The BPE merge parsing code incorrectly assumed merges were arrays of two elements (["a", "b"]), but HuggingFace tokenizer.json uses space-separated strings ("a b") as the standard format. This fix: - Adds support for legacy string format: "token1 token2" (standard HF format) - Keeps support for tuple array format: ["token1", "token2"] (for tokens with spaces) - Skips #version header lines (matching HuggingFace Rust tokenizers behavior) The implementation follows the HuggingFace Rust tokenizers library (huggingface/tokenizers) which handles both formats in tokenizers/src/models/bpe/serialization.rs. Added tests for both merge formats to verify correct parsing. Test Plan: ``` mkdir build-test cd build-test cmake ../test cmake --build . --target test_hf_tokenizer cd .. RESOURCES_PATH="test/resources" ./build-test/test_hf_tokenizer ``` Reviewed By: larryliu0820 Differential Revision: D88675986 Pulled By: mergennachin

meta-codesync · 2025-12-08T22:17:05Z

@mergennachin has exported this pull request. If you are a Meta employee, you can view the originating Diff in D88675986.

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 8, 2025

mergennachin requested a review from larryliu0820 December 8, 2025 21:34

mergennachin force-pushed the fix-bpe-merge-parsing branch from e2e854e to 20b0ba9 Compare December 8, 2025 21:37

mergennachin marked this pull request as ready for review December 8, 2025 21:45

larryliu0820 approved these changes Dec 8, 2025

View reviewed changes

facebook-github-bot force-pushed the fix-bpe-merge-parsing branch from 20b0ba9 to 6c37dc8 Compare December 8, 2025 22:16

meta-codesync bot added fb-exported meta-exported labels Dec 8, 2025

meta-codesync bot merged commit 92cf202 into main Dec 10, 2025
7 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix BPE merge parsing to handle HuggingFace tokenizer.json format #159

Fix BPE merge parsing to handle HuggingFace tokenizer.json format #159

Uh oh!

mergennachin commented Dec 8, 2025 •

edited

Loading

Uh oh!

meta-codesync bot commented Dec 8, 2025

Uh oh!

meta-codesync bot commented Dec 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix BPE merge parsing to handle HuggingFace tokenizer.json format #159

Fix BPE merge parsing to handle HuggingFace tokenizer.json format #159

Uh oh!

Conversation

mergennachin commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync bot commented Dec 8, 2025

Uh oh!

meta-codesync bot commented Dec 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mergennachin commented Dec 8, 2025 •

edited

Loading