Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise regular expressions by precompiling them #140

Closed
wants to merge 1 commit into from

Conversation

jelmervdl
Copy link
Collaborator

@jelmervdl jelmervdl commented Sep 14, 2023

I've replaced most regexps with precompiled versions.

My theory is that some of these expressions are impressively long since we're importing Perl character groups in them. These then really slow down the regexp cache lookups as they end up doing massive string comparisons each call to check whether the key matches the compiled version in the regexp compilation cache (which is just a dict).

I've skipped some of the smaller ones for now since their compile cache lookup time will be miniscule.

Also tested but currently not in this pull request, because @ZJaume mentioned it:

pip install pyre2
sed '/import re/import re2 as re/g' tokerizer.py

Command:

time cat ../empty-trainer/contrib/test-data/clean.zhen.10k | cut -f2 | python -m sacremoses -l en detokenize > /dev/null

User times (averaged over 3 runs, but they're very consistent):

branch time
Main 9.04s
This 3.28s
This with pyre2 2.71s

I've replaced most regexps with precompiled versions. My theory is that some of these expressions are impressively long since we're importing Perl character groups in them. These then really slow down the regexp cache lookups as they end up doing massive string comparisons each call to check whether the key matches the compiled version in the regexp compilation cache (which is just a dict).

I've skipped some of the smaller ones for now since their compile cache lookup time will be miniscule.
and re.search(r"^[-–]$", tokens[i + 1])
and re.search(r"^li$|^mail.*", tokens[i + 2], re.IGNORECASE)
): # In Perl, ($words[$i+2] =~ /^li$|^mail.*/i)
# In Czech, right-shift "-li" and a few Czech dashed words (e.g. e-mail)
detokenized_text += prepend_space + token + tokens[i + 1]
next(tokens, None) # Advance over the dash
next(tokens, None) # Advance over the dash #TODO this is a bug, tokens is a list, not the iterator!
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I noticed this usage of next(tokens, None) which shouldn't be allowed since tokens is the list. We're iterating over enumerate(iter(tokens)) (that iter there is a bit superfluous, but allowed).

This code would work if we'd have something like this:

token_it = iter(tokens)
for i, token in enumerate(token_it):
  if ...:
    next(token_it, None)

)
))

IS_SC = make_search(r"^[" + IsSc + r"\(\[\{\¿\¡]+$")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these \¿\¡ escaped?

@jelmervdl
Copy link
Collaborator Author

I checked the tokenize command against the main branch, and there's little performance benefit, if at all. Patterns are compiled and cached, and _compile is no longer the top time consuming function.

… And this fits well in my hypothesis! It is only in the detokenizer that patterns are dynamically constructed, e.g. lines such as re.search(r"^['][{}]".format(self.IsAlpha), token). self.IsAlpha is massive. Because the format() call constructs a new str every time, a simple id(a) == id(b) won't identify two strings as identical and full on == comparison between the two strings has to happen in the key lookup of the regexp compilation cache.

With that in mind, I'm seeing whether just moving the format() call out of the function body, into the object initialisation, is sufficient. If there's no further measurable benefit of calling compile ahead of time, I'm not going through with this.

@ZJaume
Copy link
Collaborator

ZJaume commented Sep 14, 2023

When you say little performance benefit, you mean less than what is reported in #133 ? For me, those numbers seem enough to merge ahead of time compilation.

@jelmervdl
Copy link
Collaborator Author

If I measure tokenisation, #133 is working better than what I attempted:

main: cat big.txt | python -m sacremoses -l en tokenize > /dev/null
  Time (mean ± σ):     33.647 s ±  0.401 s    [User: 33.144 s, System: 0.448 s]
  Range (min … max):   33.198 s … 34.533 s    10 runs

#133: cat big.txt | python -m sacremoses -l en tokenize > /dev/null
  Time (mean ± σ):     30.709 s ±  0.085 s    [User: 30.214 s, System: 0.479 s]
  Range (min … max):   30.589 s … 30.879 s    10 runs

this: cat big.txt | python -m sacremoses -l en tokenize > /dev/null
  Time (mean ± σ):     31.099 s ±  0.235 s    [User: 30.578 s, System: 0.495 s]
  Range (min … max):   30.736 s … 31.439 s    10 runs

Detokenisation, not so much. But I can fix that retroactively after merging #133:

#133: cat big.txt | python -m sacremoses -l en detokenize > /dev/null
  Time (mean ± σ):     35.786 s ±  0.612 s    [User: 35.058 s, System: 0.475 s]
  Range (min … max):   34.669 s … 36.835 s    10 runs

this: cat big.txt | python -m sacremoses -l en detokenize > /dev/null
  Time (mean ± σ):     12.485 s ±  0.194 s    [User: 12.073 s, System: 0.392 s]
  Range (min … max):   12.251 s … 12.775 s    10 runs

@jelmervdl
Copy link
Collaborator Author

Replaced by #133 and #143

@jelmervdl jelmervdl closed this Sep 14, 2023
@jelmervdl jelmervdl deleted the regexp-optim branch September 14, 2023 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants