Skip to content

Conversation

@mergennachin
Copy link
Contributor

@mergennachin mergennachin commented Dec 8, 2025

The BPE merge parsing code incorrectly assumed merges were arrays of
two elements (["a", "b"]), but HuggingFace tokenizer.json uses
space-separated strings ("a b") as the standard format.

This fix:

  • Adds support for legacy string format: "token1 token2" (standard HF format)
  • Keeps support for tuple array format: ["token1", "token2"] (for tokens with spaces)
  • Skips #version header lines (matching HuggingFace Rust tokenizers behavior)

The implementation follows the HuggingFace Rust tokenizers library
(huggingface/tokenizers) which handles both formats in
tokenizers/src/models/bpe/serialization.rs.

Added tests for both merge formats to verify correct parsing.

Test Plan:

mkdir build-test
cd build-test
cmake ../test
cmake --build . --target test_hf_tokenizer
cd ..
RESOURCES_PATH="test/resources" ./build-test/test_hf_tokenizer

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 8, 2025
@mergennachin mergennachin force-pushed the fix-bpe-merge-parsing branch from e2e854e to 20b0ba9 Compare December 8, 2025 21:37
@mergennachin mergennachin marked this pull request as ready for review December 8, 2025 21:45
@meta-codesync
Copy link

meta-codesync bot commented Dec 8, 2025

@mergennachin has imported this pull request. If you are a Meta employee, you can view this in D88675986.

Summary:
The BPE merge parsing code incorrectly assumed merges were arrays of
two elements (["a", "b"]), but HuggingFace tokenizer.json uses
space-separated strings ("a b") as the standard format.

This fix:
- Adds support for legacy string format: "token1 token2" (standard HF format)
- Keeps support for tuple array format: ["token1", "token2"] (for tokens with spaces)
- Skips #version header lines (matching HuggingFace Rust tokenizers behavior)

The implementation follows the HuggingFace Rust tokenizers library
(huggingface/tokenizers) which handles both formats in
tokenizers/src/models/bpe/serialization.rs.

Added tests for both merge formats to verify correct parsing.


Test Plan:
```
mkdir build-test
cd build-test
cmake ../test
cmake --build . --target test_hf_tokenizer
cd ..
RESOURCES_PATH="test/resources" ./build-test/test_hf_tokenizer
```

Reviewed By: larryliu0820

Differential Revision: D88675986

Pulled By: mergennachin
@meta-codesync
Copy link

meta-codesync bot commented Dec 8, 2025

@mergennachin has exported this pull request. If you are a Meta employee, you can view the originating Diff in D88675986.

@meta-codesync meta-codesync bot merged commit 92cf202 into main Dec 10, 2025
7 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants