-
Notifications
You must be signed in to change notification settings - Fork 833
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add possessive quantifiers to avoid catastrophic backtracking (#258)
Fixes the crash in #245 by prohibiting the regex engine from backtracking catastrophically via [possessive quantifiers](https://www.regular-expressions.info/possessive.html). <img width="400" alt="image" src="https://github.com/openai/tiktoken/assets/1841944/ed341153-4cf4-4c1c-93d6-3f5e32133569"> Interestingly these possesives make the encoding a lot faster again in `fancy-regex`. Before this change (but with large byte pair merge PR cherry-picked): ``` num_threads: 1, num_bytes: 98379553 tiktoken 11,946,036 bytes / s tiktoken 11,961,343 bytes / s tiktoken 11,995,846 bytes / s tiktoken 11,951,263 bytes / s tiktoken 11,983,405 bytes / s ``` Same, with these changes applied: ``` num_threads: 1, num_bytes: 98379553 tiktoken 14,511,827 bytes / s tiktoken 14,638,134 bytes / s tiktoken 14,644,029 bytes / s tiktoken 14,729,030 bytes / s tiktoken 14,666,903 bytes / s ``` Updating the regex libs makes it a tiny bit faster still: ``` num_threads: 1, num_bytes: 98379553 tiktoken 14,485,590 bytes / s tiktoken 14,854,049 bytes / s tiktoken 14,891,086 bytes / s tiktoken 14,843,007 bytes / s tiktoken 14,874,520 bytes / s ``` This is almost 2x faster than [before any of the optimizations](#234). ------- Opened an issue for increasing the [default backtrack limit](https://github.com/fancy-regex/fancy-regex/blob/bf2c807447f72ee20ae839e0f8cb3a06fc79982c/src/lib.rs#L407), see: fancy-regex/fancy-regex#134, but it shouldn't be necessary here anymore. --------- Co-authored-by: Lőrinc <lorinc.pap@gmail.com>
- Loading branch information
Showing
4 changed files
with
43 additions
and
11 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters