Optimise regular expressions by precompiling them #140

jelmervdl · 2023-09-14T07:49:11Z

I've replaced most regexps with precompiled versions.

My theory is that some of these expressions are impressively long since we're importing Perl character groups in them. These then really slow down the regexp cache lookups as they end up doing massive string comparisons each call to check whether the key matches the compiled version in the regexp compilation cache (which is just a dict).

I've skipped some of the smaller ones for now since their compile cache lookup time will be miniscule.

Also tested but currently not in this pull request, because @ZJaume mentioned it:

pip install pyre2
sed '/import re/import re2 as re/g' tokerizer.py

Command:

time cat ../empty-trainer/contrib/test-data/clean.zhen.10k | cut -f2 | python -m sacremoses -l en detokenize > /dev/null

User times (averaged over 3 runs, but they're very consistent):

branch	time
Main	9.04s
This	3.28s
This with pyre2	2.71s

I've replaced most regexps with precompiled versions. My theory is that some of these expressions are impressively long since we're importing Perl character groups in them. These then really slow down the regexp cache lookups as they end up doing massive string comparisons each call to check whether the key matches the compiled version in the regexp compilation cache (which is just a dict). I've skipped some of the smaller ones for now since their compile cache lookup time will be miniscule.

jelmervdl · 2023-09-14T07:52:12Z

sacremoses/tokenize.py

                and re.search(r"^[-–]$", tokens[i + 1])
                and re.search(r"^li$|^mail.*", tokens[i + 2], re.IGNORECASE)
            ):  # In Perl, ($words[$i+2] =~ /^li$|^mail.*/i)
                # In Czech, right-shift "-li" and a few Czech dashed words (e.g. e-mail)
                detokenized_text += prepend_space + token + tokens[i + 1]
-                next(tokens, None)  # Advance over the dash
+                next(tokens, None)  # Advance over the dash #TODO this is a bug, tokens is a list, not the iterator!


Also I noticed this usage of next(tokens, None) which shouldn't be allowed since tokens is the list. We're iterating over enumerate(iter(tokens)) (that iter there is a bit superfluous, but allowed).

This code would work if we'd have something like this:

token_it = iter(tokens) for i, token in enumerate(token_it): if ...: next(token_it, None)

jelmervdl · 2023-09-14T08:17:36Z

sacremoses/tokenize.py

-    )
+    ))
+
+    IS_SC = make_search(r"^[" + IsSc + r"\(\[\{\¿\¡]+$")


Why are these \¿\¡ escaped?

jelmervdl · 2023-09-14T13:46:41Z

I checked the tokenize command against the main branch, and there's little performance benefit, if at all. Patterns are compiled and cached, and _compile is no longer the top time consuming function.

… And this fits well in my hypothesis! It is only in the detokenizer that patterns are dynamically constructed, e.g. lines such as re.search(r"^['][{}]".format(self.IsAlpha), token). self.IsAlpha is massive. Because the format() call constructs a new str every time, a simple id(a) == id(b) won't identify two strings as identical and full on == comparison between the two strings has to happen in the key lookup of the regexp compilation cache.

With that in mind, I'm seeing whether just moving the format() call out of the function body, into the object initialisation, is sufficient. If there's no further measurable benefit of calling compile ahead of time, I'm not going through with this.

ZJaume · 2023-09-14T13:59:08Z

When you say little performance benefit, you mean less than what is reported in #133 ? For me, those numbers seem enough to merge ahead of time compilation.

jelmervdl · 2023-09-14T15:02:43Z

If I measure tokenisation, #133 is working better than what I attempted:

main: cat big.txt | python -m sacremoses -l en tokenize > /dev/null
  Time (mean ± σ):     33.647 s ±  0.401 s    [User: 33.144 s, System: 0.448 s]
  Range (min … max):   33.198 s … 34.533 s    10 runs

#133: cat big.txt | python -m sacremoses -l en tokenize > /dev/null
  Time (mean ± σ):     30.709 s ±  0.085 s    [User: 30.214 s, System: 0.479 s]
  Range (min … max):   30.589 s … 30.879 s    10 runs

this: cat big.txt | python -m sacremoses -l en tokenize > /dev/null
  Time (mean ± σ):     31.099 s ±  0.235 s    [User: 30.578 s, System: 0.495 s]
  Range (min … max):   30.736 s … 31.439 s    10 runs

Detokenisation, not so much. But I can fix that retroactively after merging #133:

#133: cat big.txt | python -m sacremoses -l en detokenize > /dev/null
  Time (mean ± σ):     35.786 s ±  0.612 s    [User: 35.058 s, System: 0.475 s]
  Range (min … max):   34.669 s … 36.835 s    10 runs

this: cat big.txt | python -m sacremoses -l en detokenize > /dev/null
  Time (mean ± σ):     12.485 s ±  0.194 s    [User: 12.073 s, System: 0.392 s]
  Range (min … max):   12.251 s … 12.775 s    10 runs

jelmervdl · 2023-09-14T15:39:43Z

Replaced by #133 and #143

jelmervdl commented Sep 14, 2023

View reviewed changes

jelmervdl mentioned this pull request Sep 14, 2023

Optimise is_cjk(character) #139

Merged

jelmervdl commented Sep 14, 2023

View reviewed changes

sacremoses/tokenize.py

)

))

IS_SC = make_search(r"^[" + IsSc + r"\(\[\{\¿\¡]+$")

Copy link

Collaborator Author

jelmervdl Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these \¿\¡ escaped?

jelmervdl mentioned this pull request Sep 14, 2023

Compile regexp in detokenizer #143

Merged

jelmervdl closed this Sep 14, 2023

jelmervdl deleted the regexp-optim branch September 14, 2023 15:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise regular expressions by precompiling them #140

Optimise regular expressions by precompiling them #140

jelmervdl commented Sep 14, 2023 •

edited

Loading

jelmervdl Sep 14, 2023

jelmervdl Sep 14, 2023

jelmervdl commented Sep 14, 2023

ZJaume commented Sep 14, 2023 •

edited

Loading

jelmervdl commented Sep 14, 2023

jelmervdl commented Sep 14, 2023

Optimise regular expressions by precompiling them #140

Optimise regular expressions by precompiling them #140

Conversation

jelmervdl commented Sep 14, 2023 • edited Loading

jelmervdl Sep 14, 2023

Choose a reason for hiding this comment

jelmervdl Sep 14, 2023

Choose a reason for hiding this comment

jelmervdl commented Sep 14, 2023

ZJaume commented Sep 14, 2023 • edited Loading

jelmervdl commented Sep 14, 2023

jelmervdl commented Sep 14, 2023

jelmervdl commented Sep 14, 2023 •

edited

Loading

ZJaume commented Sep 14, 2023 •

edited

Loading