A toy implementation of regular expressions using finite automata. Its API is modeled after
the standard re module, so it can be used as a drop-in replacement (not all re features
are supported, though, and it behaves differently in edge cases).
Intended as a white-box implementation, it gives full tracing output for parsing and matching. Diagrams of abstract syntax tree and the automaton are also available.
pip install regex-automata
import regex_automata as re
import logging
logging.basicConfig(level=logging.INFO) # show verbose output
pattern = re.compile(r"(foo)*bar|baz") # regex_automata.Pattern
m = pattern.fullmatch("foofoobar") # regex_automata.Match
pattern.fullmatch("foo") # None
pattern.ast # regex_automata.parser.ast.AstNode
pattern.raw_ast # regex_automata.parser.ast.AstNode
pattern.nfa # regex_automata.automata.nfa.NFA
pattern.render_ast("regex_ast.svg")
pattern.render_ast("regex_ast_raw.svg", raw=True)
pattern.render_nfa("regex_nfa.svg")
pattern2 = re.compile(r"[a-z_-][a-z0-9_-]*", re.IGNORECASE)
pattern2.tokens
# [CharacterSet(span=(0, 7), text='[a-z_-]', set=RangeSet(((45, 46), (95, 96), (97, 123)))),
# CharacterSet(span=(7, 17), text='[a-z0-9_-]', set=RangeSet(((45, 46), (48, 58), (95, 96), (97, 123)))),
# Repetition(span=(17, 18), text='*', min=0, max=None)]
list(re.finditer(r"[0-9]{2,}", "123"))
# [<Match span=(0, 3), match='123'>]
list(re.finditer(r"[0-9]{2,}", "123", all_matches=True))
# [<Match span=(0, 3), match='123'>, <Match span=(0, 2), match='12'>, <Match span=(1, 3), match='23'>]Abstract syntax tree of "(foo)*bar|baz" (ie. pattern.ast):
Finite automaton accepting "(foo)*bar|baz" (ie. pattern.nfa):
regex-automata is generally compatible with re - features either work as intended or
fail with regex_automata.errors.UnsupportedSyntaxError. In some edge cases, results differ:
most notably when there are multiple greedy quantifiers next to each other.
regex-automata passes 299/305 re pattern tests,
with additional 98 tests ignored due to testing unsupported features.
-
Library
match(),fullmatch(),search(),finditer(),sub(),subn()methodsMatchobject containing span, matched text and groups- flags
DOTALL,IGNORECASEandMULTILINE
-
Syntax
- character sets:
.,[...](special sequences such as\ware supported, but not inside square brackets) - repetition:
*,?,+,{n,k} - boundary assertions:
^,$,\b,\B,\A,\Z - groups:
(...),(?:...),(?P<name>...) - inline flags eg.
(?i) - comments
(?#...)
- character sets:
-
Notable features that are not supported by this library but are in standard
re:- backreferences in patterns (
\1,\g<1>etc.) - lookahead/lookbehind assertions (
(?=...),(?!...),(?<=...),(?<!...)) - non-greedy and possessive repetition (
*?,*+, etc.) bytessupportUNICODEflag, non-ASCII meaning for\detc.
- backreferences in patterns (
- Input pattern is tokenized via
regex_automata.parser.tokenizer.Tokenizer- Characters and sets are represented with
regex_automata.automata.rangeset.RangeSet
- Characters and sets are represented with
- List of tokens is processed by recursive descent parser
regex_automata.parser.parser.Parser - Parser produces "raw" abstract syntax tree composed of
regex_automata.parser.ast.AstNodenodes - AST is processed with
regex_automata.parser.ast_processor.ASTProcessorto produce the final tree- This is used to replace fancy repetition with primitives (union, concatenation, iteration)
- Epsilon-free NFA is recursively constructed from the AST using
regex_automata.regex.nfa_builder.NFABuilder - The processed pattern is stored in
regex_automata.regex.pattern.Pattern, which is the high-level interface - When processing input text, the text and NFA are passed to
regex_automata.regex.nfa_evaluator.NFAEvaluator - The evaluator produces
regex_automata.regex.match.Matchobjects
The recursive descent parser uses the following LL(1) grammar:
1. E → F E'
2. E' → | E
3. E' → ε
4. F → G F'
5. F' → G F'
6. F' → ε
7. G → H G'
8. G' → *
9. G' → ε
10. H → ( E )
11. H → a (a character or character set)
12. E → ε
13. F → ε
14. G → boundary_assertion (one of: ^, $, \A, \Z, \b, \B)
Which is derived from the following CFG:
E → F | E
E → F
E → ε
F → G F
F → G
G → H *
G → H
G → boundary_assertion
H → ( E )
H → a
Which is derived from the following CFG:
E → E | E
E → E E
E → E *
E → ( E )
E → a
E → boundary_assertion
E → ε
MIT, see LICENSE.txt.