-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise regular expressions by precompiling them #140
Conversation
I've replaced most regexps with precompiled versions. My theory is that some of these expressions are impressively long since we're importing Perl character groups in them. These then really slow down the regexp cache lookups as they end up doing massive string comparisons each call to check whether the key matches the compiled version in the regexp compilation cache (which is just a dict). I've skipped some of the smaller ones for now since their compile cache lookup time will be miniscule.
and re.search(r"^[-–]$", tokens[i + 1]) | ||
and re.search(r"^li$|^mail.*", tokens[i + 2], re.IGNORECASE) | ||
): # In Perl, ($words[$i+2] =~ /^li$|^mail.*/i) | ||
# In Czech, right-shift "-li" and a few Czech dashed words (e.g. e-mail) | ||
detokenized_text += prepend_space + token + tokens[i + 1] | ||
next(tokens, None) # Advance over the dash | ||
next(tokens, None) # Advance over the dash #TODO this is a bug, tokens is a list, not the iterator! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I noticed this usage of next(tokens, None)
which shouldn't be allowed since tokens
is the list. We're iterating over enumerate(iter(tokens))
(that iter
there is a bit superfluous, but allowed).
This code would work if we'd have something like this:
token_it = iter(tokens)
for i, token in enumerate(token_it):
if ...:
next(token_it, None)
) | ||
)) | ||
|
||
IS_SC = make_search(r"^[" + IsSc + r"\(\[\{\¿\¡]+$") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these \¿\¡
escaped?
I checked the … And this fits well in my hypothesis! It is only in the detokenizer that patterns are dynamically constructed, e.g. lines such as With that in mind, I'm seeing whether just moving the |
When you say little performance benefit, you mean less than what is reported in #133 ? For me, those numbers seem enough to merge ahead of time compilation. |
If I measure tokenisation, #133 is working better than what I attempted:
Detokenisation, not so much. But I can fix that retroactively after merging #133:
|
I've replaced most regexps with precompiled versions.
My theory is that some of these expressions are impressively long since we're importing Perl character groups in them. These then really slow down the regexp cache lookups as they end up doing massive string comparisons each call to check whether the key matches the compiled version in the regexp compilation cache (which is just a dict).
I've skipped some of the smaller ones for now since their compile cache lookup time will be miniscule.
Also tested but currently not in this pull request, because @ZJaume mentioned it:
pip install pyre2 sed '/import re/import re2 as re/g' tokerizer.py
Command:
User times (averaged over 3 runs, but they're very consistent):