Skip to content

Rebuilding grep to learn how pattern matching works under the hood.

Md-Talim/codecrafters-grep-go

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Grep Implementation in Go

progress-banner

This project is a Go implementation of the grep command-line tool, built as part of the Codecrafters "Build your own grep" challenge. This implementation provides a hands-on understanding of regular expression parsing, pattern matching algorithms, and how text search utilities work under the hood.

📌 What This Project Does

This grep implementation covers the fundamental pattern matching operations that grep performs:

  • Character Class Matching: Support for \d (digits) and \w (word characters) patterns
  • Character Groups: Both positive [abc] and negative [^abc] character sets
  • Anchors: Start-of-line ^ and end-of-line $ matching
  • Quantifiers: One-or-more + and zero-or-one ? repetition operators
  • Wildcard Matching: Single character wildcard . support
  • Literal Character Matching: Direct character matching for simple patterns
  • Complex Pattern Sequences: Combining multiple pattern elements in sequence

Note: I haven't implemented alternation (|) yet, but it's coming next.

✨ Key Features

  • Modular Architecture: Clean separation of concerns with multiple matcher types:
    • Character Class Matcher: Handles \d and \w patterns
    • Character Group Matchers: Support for both positive and negative character sets
    • Literal Matcher: Direct character matching
    • Core Matcher: Complex pattern sequences with anchors and quantifiers
  • Chain of Responsibility Pattern: Matchers are tried in order until one can handle the pattern
  • Comprehensive Pattern Support: From simple literal matches to complex regex sequences
  • Error Handling: Proper error reporting for invalid patterns and edge cases
  • Performance Optimized: Efficient pattern matching with minimal backtracking

🛠️ Why I Built This Project

I wanted to understand how pattern matching works at a fundamental level. This project served as an excellent opportunity to:

  • Demystify Regular Expressions: Understand how regex engines parse and execute patterns
  • Learn Pattern Matching Algorithms: Implement recursive descent parsing and backtracking
  • Practice Go Development: Work with string manipulation, interfaces, and error handling
  • Explore Compiler Theory: Understand lexical analysis and pattern recognition
  • Build Foundation Knowledge: Prepare for implementing more complex regex features

🔍 How It Works Internally

The grep implementation processes patterns through several key components:

1. Pattern Matching Engine

  • RegexEngine serves as the main entry point, implementing a chain of responsibility pattern
  • Simple patterns are handled by specialized matchers for optimal performance
  • Complex patterns fall through to the CoreMatcher for full regex processing

2. Specialized Matchers

  • CharacterClassMatcher: Handles \d and \w character classes using predicate functions
  • NegativeCharacterGroupMatcher: Processes [^abc] patterns by checking exclusion
  • PositiveCharacterGroupMatcher: Handles [abc] patterns using efficient character set lookups
  • LiteralMatcher: Direct character matching for single-character patterns

3. Core Pattern Processing

  • Recursive descent parsing for complex pattern sequences
  • Anchor handling for start (^) and end ($) of line matching
  • Quantifier support with proper greedy matching behavior
  • Backtracking for trying alternative match positions

4. Utility Functions

  • utils.go provides character classification functions (isDigit, isAlphabetic, isAlphanumeric)
  • Character predicate system for flexible pattern matching
  • Efficient character group validation

⚙️ How to Set Up and Run

Prerequisites

  • Go 1.24 or later (as specified in go.mod)

Installation

  1. Clone the repository:

    git clone https://github.com/codecrafters-io/grep-starter-go.git
    cd grep-starter-go
  2. Build the project:

    go build -o mygrep app/*.go

Running the Grep Implementation

General Usage:

./mygrep -E <pattern>

The program reads from stdin and exits with code 0 if the pattern matches, 1 if no match is found, or 2 for errors.

Pattern Examples:

Pattern Type Example Description Matches
Character Classes \d Any digit 123, a1b
\w Any word character hello, test_123
Character Groups [abc] Any of a, b, or c apple, bat
[^xyz] Any char except x, y, z apple, dog
Anchors ^hello Start of line hello world
world$ End of line hello world
Quantifiers ca+t One or more 'a' cat, caaat
ca?t Zero or one 'a' ct, cat
Wildcard c.t Any character between c and t cat, cut, c1t

Example Usage

  1. Match digits:

    echo "hello123" | ./mygrep -E "\d"
    # Exit code: 0 (match found)
  2. Match character group:

    echo "apple" | ./mygrep -E "[aeiou]"
    # Exit code: 0 (vowel found)
  3. Match with quantifiers:

    echo "cat" | ./mygrep -E "ca+t"
    # Exit code: 0 (one or more 'a' found)
  4. Anchor matching:

    echo "hello world" | ./mygrep -E "^hello"
    # Exit code: 0 (line starts with "hello")
  5. No match example:

    echo "hello" | ./mygrep -E "\d"
    # Exit code: 1 (no digits found)

💡 What I Learned

Regular Expression Theory

  • Pattern Matching Algorithms: Understanding how regex engines parse and execute patterns
  • Finite Automata: The theoretical foundation behind pattern matching state machines
  • Backtracking: How regex engines handle multiple possible matches and optimization strategies
  • Greedy vs Lazy Matching: The behavior of quantifiers and their performance implications
  • Anchoring: How start and end anchors affect pattern matching strategy

String Processing in Go

  • Byte vs Rune Handling: Understanding the difference between byte-level and Unicode-aware string processing
  • Efficient String Operations: Using byte slices for performance-critical pattern matching
  • Character Classification: Implementing efficient character set validation
  • Pattern Compilation: Converting string patterns into executable matching logic

Software Architecture Patterns

  • Chain of Responsibility: Implementing a clean pattern for trying multiple matching strategies
  • Strategy Pattern: Different matchers for different pattern types
  • Interface Design: Creating clean abstractions for pattern matching operations
  • Error Handling: Proper error propagation in complex pattern matching scenarios

Test-Driven Development (TDD) with Go

  • Comprehensive Test Coverage: Testing all pattern types and edge cases
  • Table-Driven Tests: Using Go's idiomatic approach for testing multiple scenarios
  • Integration Testing: Full end-to-end testing of the grep application
  • Error Case Testing: Ensuring robust handling of invalid patterns and edge cases

The TDD approach was particularly valuable when implementing the quantifiers and anchors, as it helped ensure that complex pattern sequences worked correctly across different text inputs and pattern combinations.

🎯 Challenges & Solutions

  • Backtracking Complexity: Implementing efficient backtracking without infinite loops in complex patterns
  • Character Encoding: Handling both ASCII and Unicode characters properly in pattern matching
  • Performance Optimization: Balancing feature completeness with matching speed
  • Pattern Parsing: Building a robust parser that handles nested patterns and escape sequences
  • Edge Case Handling: Ensuring correct behavior for empty strings, boundary conditions, and malformed patterns

🚀 Future Enhancements

The next major milestone is implementing alternation (|), which will involve:

  • Alternative Parsing: Implementing proper parsing of (pattern1|pattern2) syntax
  • Capture Groups: Supporting parenthesized subpatterns for grouping
  • Backtracking Optimization: Enhancing the backtracking algorithm for alternation
  • Advanced Quantifiers: Adding support for {n,m} quantifier syntax
  • Extended Character Classes: Supporting \s, \S, \b, and other common classes

Additional planned features include:

  • Case-Insensitive Matching: Adding -i flag support
  • Multi-line Mode: Supporting ^ and $ in multi-line text
  • Lookahead/Lookbehind: Advanced assertion operators
  • Unicode Support: Full Unicode character class support

🙏 Acknowledgments

About

Rebuilding grep to learn how pattern matching works under the hood.

Topics

Resources

Stars

Watchers

Forks

Contributors 2

  •  
  •