This project is a Go implementation of the grep command-line tool, built as part of the Codecrafters "Build your own grep" challenge. This implementation provides a hands-on understanding of regular expression parsing, pattern matching algorithms, and how text search utilities work under the hood.
This grep implementation covers the fundamental pattern matching operations that grep performs:
- Character Class Matching: Support for
\d(digits) and\w(word characters) patterns - Character Groups: Both positive
[abc]and negative[^abc]character sets - Anchors: Start-of-line
^and end-of-line$matching - Quantifiers: One-or-more
+and zero-or-one?repetition operators - Wildcard Matching: Single character wildcard
.support - Literal Character Matching: Direct character matching for simple patterns
- Complex Pattern Sequences: Combining multiple pattern elements in sequence
Note: I haven't implemented alternation (|) yet, but it's coming next.
- Modular Architecture: Clean separation of concerns with multiple matcher types:
- Character Class Matcher: Handles
\dand\wpatterns - Character Group Matchers: Support for both positive and negative character sets
- Literal Matcher: Direct character matching
- Core Matcher: Complex pattern sequences with anchors and quantifiers
- Character Class Matcher: Handles
- Chain of Responsibility Pattern: Matchers are tried in order until one can handle the pattern
- Comprehensive Pattern Support: From simple literal matches to complex regex sequences
- Error Handling: Proper error reporting for invalid patterns and edge cases
- Performance Optimized: Efficient pattern matching with minimal backtracking
I wanted to understand how pattern matching works at a fundamental level. This project served as an excellent opportunity to:
- Demystify Regular Expressions: Understand how regex engines parse and execute patterns
- Learn Pattern Matching Algorithms: Implement recursive descent parsing and backtracking
- Practice Go Development: Work with string manipulation, interfaces, and error handling
- Explore Compiler Theory: Understand lexical analysis and pattern recognition
- Build Foundation Knowledge: Prepare for implementing more complex regex features
The grep implementation processes patterns through several key components:
RegexEngineserves as the main entry point, implementing a chain of responsibility pattern- Simple patterns are handled by specialized matchers for optimal performance
- Complex patterns fall through to the
CoreMatcherfor full regex processing
CharacterClassMatcher: Handles\dand\wcharacter classes using predicate functionsNegativeCharacterGroupMatcher: Processes[^abc]patterns by checking exclusionPositiveCharacterGroupMatcher: Handles[abc]patterns using efficient character set lookupsLiteralMatcher: Direct character matching for single-character patterns
- Recursive descent parsing for complex pattern sequences
- Anchor handling for start (
^) and end ($) of line matching - Quantifier support with proper greedy matching behavior
- Backtracking for trying alternative match positions
utils.goprovides character classification functions (isDigit,isAlphabetic,isAlphanumeric)- Character predicate system for flexible pattern matching
- Efficient character group validation
- Go 1.24 or later (as specified in
go.mod)
-
Clone the repository:
git clone https://github.com/codecrafters-io/grep-starter-go.git cd grep-starter-go -
Build the project:
go build -o mygrep app/*.go
General Usage:
./mygrep -E <pattern>The program reads from stdin and exits with code 0 if the pattern matches, 1 if no match is found, or 2 for errors.
Pattern Examples:
| Pattern Type | Example | Description | Matches |
|---|---|---|---|
| Character Classes | \d |
Any digit | 123, a1b |
\w |
Any word character | hello, test_123 |
|
| Character Groups | [abc] |
Any of a, b, or c | apple, bat |
[^xyz] |
Any char except x, y, z | apple, dog |
|
| Anchors | ^hello |
Start of line | hello world |
world$ |
End of line | hello world |
|
| Quantifiers | ca+t |
One or more 'a' | cat, caaat |
ca?t |
Zero or one 'a' | ct, cat |
|
| Wildcard | c.t |
Any character between c and t | cat, cut, c1t |
-
Match digits:
echo "hello123" | ./mygrep -E "\d" # Exit code: 0 (match found)
-
Match character group:
echo "apple" | ./mygrep -E "[aeiou]" # Exit code: 0 (vowel found)
-
Match with quantifiers:
echo "cat" | ./mygrep -E "ca+t" # Exit code: 0 (one or more 'a' found)
-
Anchor matching:
echo "hello world" | ./mygrep -E "^hello" # Exit code: 0 (line starts with "hello")
-
No match example:
echo "hello" | ./mygrep -E "\d" # Exit code: 1 (no digits found)
- Pattern Matching Algorithms: Understanding how regex engines parse and execute patterns
- Finite Automata: The theoretical foundation behind pattern matching state machines
- Backtracking: How regex engines handle multiple possible matches and optimization strategies
- Greedy vs Lazy Matching: The behavior of quantifiers and their performance implications
- Anchoring: How start and end anchors affect pattern matching strategy
- Byte vs Rune Handling: Understanding the difference between byte-level and Unicode-aware string processing
- Efficient String Operations: Using byte slices for performance-critical pattern matching
- Character Classification: Implementing efficient character set validation
- Pattern Compilation: Converting string patterns into executable matching logic
- Chain of Responsibility: Implementing a clean pattern for trying multiple matching strategies
- Strategy Pattern: Different matchers for different pattern types
- Interface Design: Creating clean abstractions for pattern matching operations
- Error Handling: Proper error propagation in complex pattern matching scenarios
- Comprehensive Test Coverage: Testing all pattern types and edge cases
- Table-Driven Tests: Using Go's idiomatic approach for testing multiple scenarios
- Integration Testing: Full end-to-end testing of the grep application
- Error Case Testing: Ensuring robust handling of invalid patterns and edge cases
The TDD approach was particularly valuable when implementing the quantifiers and anchors, as it helped ensure that complex pattern sequences worked correctly across different text inputs and pattern combinations.
- Backtracking Complexity: Implementing efficient backtracking without infinite loops in complex patterns
- Character Encoding: Handling both ASCII and Unicode characters properly in pattern matching
- Performance Optimization: Balancing feature completeness with matching speed
- Pattern Parsing: Building a robust parser that handles nested patterns and escape sequences
- Edge Case Handling: Ensuring correct behavior for empty strings, boundary conditions, and malformed patterns
The next major milestone is implementing alternation (|), which will involve:
- Alternative Parsing: Implementing proper parsing of
(pattern1|pattern2)syntax - Capture Groups: Supporting parenthesized subpatterns for grouping
- Backtracking Optimization: Enhancing the backtracking algorithm for alternation
- Advanced Quantifiers: Adding support for
{n,m}quantifier syntax - Extended Character Classes: Supporting
\s,\S,\b, and other common classes
Additional planned features include:
- Case-Insensitive Matching: Adding
-iflag support - Multi-line Mode: Supporting
^and$in multi-line text - Lookahead/Lookbehind: Advanced assertion operators
- Unicode Support: Full Unicode character class support
- The Codecrafters "Build your own grep" challenge provided the structure and test cases
- This project implements regular expression matching as described in "The Practice of Programming" by Kernighan and Pike