regex-automata

A toy implementation of regular expressions using finite automata. Its API is modeled after the standard re module, so it can be used as a drop-in replacement (not all re features are supported, though, and it behaves differently in edge cases).

Intended as a white-box implementation, it gives full tracing output for parsing and matching. Diagrams of abstract syntax tree and the automaton are also available.

Installation

pip install regex-automata

Usage

import regex_automata as re
import logging

logging.basicConfig(level=logging.INFO)             # show verbose output

pattern = re.compile(r"(foo)*bar|baz")              # regex_automata.Pattern

m = pattern.fullmatch("foofoobar")                  # regex_automata.Match
pattern.fullmatch("foo")                            # None

pattern.ast                                         # regex_automata.parser.ast.AstNode
pattern.raw_ast                                     # regex_automata.parser.ast.AstNode
pattern.nfa                                         # regex_automata.automata.nfa.NFA

pattern.render_ast("regex_ast.svg")
pattern.render_ast("regex_ast_raw.svg", raw=True)
pattern.render_nfa("regex_nfa.svg")

pattern2 = re.compile(r"[a-z_-][a-z0-9_-]*", re.IGNORECASE)
pattern2.tokens
# [CharacterSet(span=(0, 7), text='[a-z_-]', set=RangeSet(((45, 46), (95, 96), (97, 123)))),
#  CharacterSet(span=(7, 17), text='[a-z0-9_-]', set=RangeSet(((45, 46), (48, 58), (95, 96), (97, 123)))),
#  Repetition(span=(17, 18), text='*', min=0, max=None)]

list(re.finditer(r"[0-9]{2,}", "123"))
# [<Match span=(0, 3), match='123'>]
list(re.finditer(r"[0-9]{2,}", "123", all_matches=True))
# [<Match span=(0, 3), match='123'>, <Match span=(0, 2), match='12'>, <Match span=(1, 3), match='23'>]

Abstract syntax tree of "(foo)*bar|baz" (ie. pattern.ast):

Finite automaton accepting "(foo)*bar|baz" (ie. pattern.nfa):

Features compared to standard `re` module

regex-automata is generally compatible with re - features either work as intended or fail with regex_automata.errors.UnsupportedSyntaxError. In some edge cases, results differ: most notably when there are multiple greedy quantifiers next to each other.

regex-automata passes 299/305 re pattern tests, with additional 98 tests ignored due to testing unsupported features.

Library
- match(), fullmatch(), search(), finditer(), sub(), subn() methods
- Match object containing span, matched text and groups
- flags DOTALL, IGNORECASE and MULTILINE
Syntax
- character sets: ., [...] (special sequences such as \w are supported, but not inside square brackets)
- repetition: *, ?, +, {n,k}
- boundary assertions: ^, $, \b, \B, \A, \Z
- groups: (...), (?:...), (?P<name>...)
- inline flags eg. (?i)
- comments (?#...)
Notable features that are not supported by this library but are in standard re:
- backreferences in patterns (\1, \g<1> etc.)
- lookahead/lookbehind assertions ((?=...), (?!...), (?<=...), (?<!...))
- non-greedy and possessive repetition (*?, *+, etc.)
- bytes support
- UNICODE flag, non-ASCII meaning for \d etc.

Implementation overview

Input pattern is tokenized via regex_automata.parser.tokenizer.Tokenizer
- Characters and sets are represented with regex_automata.automata.rangeset.RangeSet
List of tokens is processed by recursive descent parser regex_automata.parser.parser.Parser
Parser produces "raw" abstract syntax tree composed of regex_automata.parser.ast.AstNode nodes
AST is processed with regex_automata.parser.ast_processor.ASTProcessor to produce the final tree
- This is used to replace fancy repetition with primitives (union, concatenation, iteration)
Epsilon-free NFA is recursively constructed from the AST using regex_automata.regex.nfa_builder.NFABuilder
The processed pattern is stored in regex_automata.regex.pattern.Pattern, which is the high-level interface
When processing input text, the text and NFA are passed to regex_automata.regex.nfa_evaluator.NFAEvaluator
The evaluator produces regex_automata.regex.match.Match objects

Grammar

The recursive descent parser uses the following LL(1) grammar:

 1.  E  → F E'
 2.  E' → | E
 3.  E' → ε
 4.  F  → G F'
 5.  F' → G F'
 6.  F' → ε
 7.  G  → H G'
 8.  G' → *
 9.  G' → ε
10.  H  → ( E )
11.  H  → a                     (a character or character set)
12.  E  → ε
13.  F  → ε
14.  G  → boundary_assertion    (one of: ^, $, \A, \Z, \b, \B)

Which is derived from the following CFG:

E → F | E
E → F
E → ε
F → G F
F → G
G → H *
G → H
G → boundary_assertion
H → ( E )
H → a

Which is derived from the following CFG:

E → E | E
E → E E
E → E *
E → ( E )
E → a
E → boundary_assertion
E → ε

License

MIT, see LICENSE.txt.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github/workflows		.github/workflows
src/regex_automata		src/regex_automata
static		static
tests		tests
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

regex-automata

Installation

Usage

Features compared to standard `re` module

Implementation overview

Grammar

License

About

Uh oh!

Languages

License

tkarabela/regex-automata

Folders and files

Latest commit

History

Repository files navigation

regex-automata

Installation

Usage

Features compared to standard re module

Implementation overview

Grammar

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages

Features compared to standard `re` module