Skip to content

Conversation

@CloseChoice
Copy link
Contributor

@CloseChoice CloseChoice commented Oct 21, 2025

closes #1819

We already support add_prefix_space for processors.ByteLevel, it is just not documented since this class is defined via pyo3's pyclass, see here. See further down the signature for the new function, which mirrors python's __new__ method: https://github.com/huggingface/tokenizers/blob/main/bindings/python/src/processors.rs#L492.

@CloseChoice CloseChoice changed the title Docs add prefix space DOCS: add add_prefix_space to processors.ByteLevel Oct 21, 2025
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but lets update to proper doc

Comment on lines 487 to 489
///
/// add_prefix_space (:obj:`bool`, `optional`, defaults to :obj:`True`):
/// Whether the add_prefix_space option was enabled during pre-tokenization. This
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the doc is wrong in the sense that this does not remove the space in the post processor nor does it add it. It just accounts for it in the offset computation if trim_offsets.

                    // If we are processing the first pair of offsets, with `add_prefix_space`,
                    // then we shouldn't remove anything we added. If there are more than one
                    // leading spaces though, it means we didn't add them, and they should be
                    // removed.

try to incorporate this please!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, I updated the docs. I somehow misunderstood the meaning of add_prefix_space for the postprocessor, but it simply refers to the preprocessor and has more of a technical meaning here. To actually understand the meaning I added some code to the tests and thought that might be helpful to keep. Let me know if you want further changes.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArthurZucker
Copy link
Collaborator

just missing make style

@CloseChoice
Copy link
Contributor Author

just missing make style

done

@ArthurZucker
Copy link
Collaborator

AssertionError: The content of py_src/tokenizers/processors/__init__.pyi seems outdated, please run `python stub.py`

😉

@CloseChoice
Copy link
Contributor Author

That was weird, nothing happened when I ran python stub.py, just once I merged main it had an effect. So I hope I didn't miss something again, sorry for the inconvenience.

@ArthurZucker
Copy link
Collaborator

you need to cargo build or maturin dev then run python stub

@ArthurZucker ArthurZucker merged commit b83d7c9 into huggingface:main Nov 28, 2025
29 checks passed
@ArthurZucker
Copy link
Collaborator

Ty!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug with add_prefix_space parameter for ByteLevel post-processor

3 participants