Skip to content

Inefficient Regular Expression Complexity in nltk (word_tokenize, sent_tokenize)

High
tomaarsen published GHSA-f8m6-h2c7-8h9x Dec 23, 2021

Package

pip nltk (pip)

Affected versions

<3.6.5

Patched versions

3.6.6

Description

Impact

The vulnerability is present in PunktSentenceTokenizer, sent_tokenize and word_tokenize. Any users of this class, or these two functions, are vulnerable to a Regular Expression Denial of Service (ReDoS) attack.
In short, a specifically crafted long input to any of these vulnerable functions will cause them to take a significant amount of execution time. The effect of this vulnerability is noticeable with the following example:

from nltk.tokenize import word_tokenize

n = 8
for length in [10**i for i in range(2, n)]:
    # Prepare a malicious input
    text = "a" * length
    start_t = time.time()
    # Call `word_tokenize` and naively measure the execution time
    word_tokenize(text)
    print(f"A length of {length:<{n}} takes {time.time() - start_t:.4f}s")

Which gave the following output during testing:

A length of 100      takes 0.0060s
A length of 1000     takes 0.0060s
A length of 10000    takes 0.6320s
A length of 100000   takes 56.3322s
...

I canceled the execution of the program after running it for several hours.

If your program relies on any of the vulnerable functions for tokenizing unpredictable user input, then we would strongly recommend upgrading to a version of NLTK without the vulnerability, or applying the workaround described below.

Patches

The problem has been patched in NLTK 3.6.6. After the fix, running the above program gives the following result:

A length of 100      takes 0.0070s
A length of 1000     takes 0.0010s
A length of 10000    takes 0.0060s
A length of 100000   takes 0.0400s
A length of 1000000  takes 0.3520s
A length of 10000000 takes 3.4641s

This output shows a linear relationship in execution time versus input length, which is desirable for regular expressions.
We recommend updating to NLTK 3.6.6+ if possible.

Workarounds

The execution time of the vulnerable functions is exponential to the length of a malicious input. With other words, the execution time can be bounded by limiting the maximum length of an input to any of the vulnerable functions. Our recommendation is to implement such a limit.

References

For more information

If you have any questions or comments about this advisory:

Severity

High

CVSS overall score

This score calculates overall vulnerability severity from 0 to 10 and is based on the Common Vulnerability Scoring System (CVSS).
/ 10

CVSS v3 base metrics

Attack vector
Network
Attack complexity
Low
Privileges required
None
User interaction
None
Scope
Unchanged
Confidentiality
None
Integrity
None
Availability
High

CVSS v3 base metrics

Attack vector: More severe the more the remote (logically and physically) an attacker can be in order to exploit the vulnerability.
Attack complexity: More severe for the least complex attacks.
Privileges required: More severe if no privileges are required.
User interaction: More severe when no user interaction is required.
Scope: More severe when a scope change occurs, e.g. one vulnerable component impacts resources in components beyond its security scope.
Confidentiality: More severe when loss of data confidentiality is highest, measuring the level of data access available to an unauthorized user.
Integrity: More severe when loss of data integrity is the highest, measuring the consequence of data modification possible by an unauthorized user.
Availability: More severe when the loss of impacted component availability is highest.
CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H

CVE ID

CVE-2021-43854

Weaknesses

Credits