Skip to content
Iury O. G. Figueiredo edited this page Jun 23, 2020 · 41 revisions

Welcome to the eacc wiki!

In this document you'll find enough documentation to use eacc as a parsing tool.

Lexer

The lexer design is quite succinct. You define token extractors using classes. The order in which the extractors are defined it allows a fine control of what is gonna be matched.

Token class

The Token instances are what token extractors produce when its regex's are matched against the document string.

class Token:
    def __init__(self, data, type=None, value=None, 
        start=None, end=None):

The Token instance contains the attributes below.

data

The token raw string that was extracted.

type

The token type that was specified in the token extractor.

value

The casted data value when it is specified.

start

The start of the token in the document string.

end

The end of the token in the document string.

Token Types

A token has a type associated to it. When a chunk of text is matched against a token regex it instantiates a Token class and a type is passed to it as an argument.

The example below shows how you can define your own tokens.

from eacc.token import TokType

class Pipe(TokType):
    pass

Token Extractors

There are two kind of extractors. The LexTok class is used to extract tokens that do not demand a lexical lookahead. The most scenaries will demand just LexTok class to tokenize a document.

class LexTok(XNode):
    def __init__(self, regstr, type=TokVal, cast=None, discard=False, wrapper=None):

regstr

The regex that corresponds to the token.

type

The token type.

cast

A function that receives the extracted string and can be used to cast the token value. Like int, float etc.

discard

It defines whether the tokenn will be considered or not.

wrapper

A function that receives the regex match object and can be used to give better control on what the Token instance will hold.

There are circumstances that it is necessary to extract tokens that are associated with next tokens. For example when matching keywords in a programming language.

The below classes are used to handle these situations.

class SeqTok(LexTok):
    def __init__(self, regstr, type=TokVal, cast=None, discard=False):
class LexSeq(XNode):
    def __init__(self, *args, wrapper=None):

The SeqTok class arguments are the same given to LexTok. The piece of code below examplifies how it looks like.

Basic Example

This example is used to tokenize keywords correctly. You define your extractors then set your document token specification.

from eacc.lexer import XSpec, Lexer, SeqTok, LexTok, LexSeq
from eacc.token import Keyword, Identifier, RP, LP, Colon, Blank

class KeywordTokens(XSpec):
    t_if = LexSeq(SeqTok(r'if', type=Keyword),
    SeqTok(r'\s+', type=Blank))

    t_blank  = LexTok(r' +', type=Blank)
    t_lparen = LexTok(r'\(', type=LP)
    t_rparen = LexTok(r'\)', type=RP)
    t_colon  = LexTok(r'\:', type=Colon)

    # Match identifier only if it is not an if.
    t_identifier = LexTok(r'[a-zA-Z0-9]+', type=Identifier)

    root = [t_if, t_blank, t_lparen, 
    t_rparen, t_colon, t_identifier]

lex = Lexer(KeywordTokens)
data = 'if ifnum: foobar()'
tokens = lex.feed(data)
print('Consumed:', list(tokens))

Would output.

[tau@archlinux demo]$ python keywords.py 
Consumed: [Keyword('if'), Blank(' '), Identifier('ifnum'), 
Colon(':'), Blank(' '), Identifier('foobar'), LP('('), RP(')')]

That basically means it will match 'if' only if it is followed by a string made of blank characters. A given document token specification is defined as above. You inherit your document spec class from XSpec, you define your extractors then you make a list of token extractors as above.

The class attribute root will contain all extractors and it will be processed by the lexer to produce the tokens.

Lexical Errors

The Lexer class accepts an argument no_errors to allow the lexer tokenizing documents that aren't well formed in terms of token types.

When no_errors is False it will raise a LexError in case the document contains a string that wasn't consumed by any of the token extractors.

Positional Tracking

The Token class has the attributes start and end to mean the token position in the document string.

from eacc.lexer import Lexer, LexTok, XSpec
from eacc.token import Char

class CharTokens(XSpec):
    t_char  = LexTok(r'.', Char)
    root = [t_char]

data = 'abc'
lexer  = Lexer(CharTokens)
tokens = lexer.feed(data)

for ind in tokens:
    print('%s\nStart:%s\nEnd:%s\n' % (ind, ind.start, ind.end))

Would output:

[tau@archlinux demo]$ python chars.py 
Char('a')
Start:0
End:1

Char('b')
Start:1
End:2

Char('c')
Start:2
End:3

Discarded Tokens

There will exist situations that some tokens aren't necessary and should be discarded. The token extractor cnstructors receive a discard argument to avoid a given token to be considered.

You would merely need to do something like below to discard white space strings.

    t_blank  = LexTok(r' +', Blank, discard=True)

Parser

The parser is based on the concept of exhaustion of patterns. Token patterns are trigged and associated to a token type then rematched again. The process goes on until the entire document is consumed.

Token Patterns

A token pattern is defined using a Rule class instance. The below example shows a token pattern.

    r_plus  = Rule(Num, Plus, Num, type=Num, up=(o_mul, o_div))

That token pattern has three main points; the sequence of tokens for the pattern, a type that it will be associated to be rematched and a sequence of precedence rules.

The up argument to the Rule class constructor defines a sequence of precedence rules that shouldn't be matched ahead in order for the current rule to be matched.

Rule Handles

When a given rule is trigged then it is possible to map a handle to receive the rule matched tokens. It is a basic feature that is used to process documents to produce a result.

An example is a basic expression calculator that associates handles to mathematical structures that are trigged and its handles are executed thus producing a number as a result.

Basic Example

The example below implements a simple mathematical expression calculator.

First it does imports for existing eacc tokens and basic classes to define both lexer and grammar specifications.

The code below specifies the set of possible tokens for the document that is a mathematical expression in the actual context. It also defines the grammar to process a given expression to obtain a result. The calculator below is implemented on a simple assertion; When two basic mathematical operations are performed it results a number.

When a Rule is matched it will generate a ParseTree instance that contains what it was matched. The tokens will be passed to the Rule instance handle and the handle will process the result.

The handle return value is the result of processing the tokens. The rule ParseTree instance will hold a type that will be used to be matched again against the existing rules. The process goes on and on until the document is consumed entirely.

The lexer adds Sof and Eof at the start and end of the document tokens and it is used as a flag by the parser to know when i5 should stop processing. You define a r_done rule containing these tokens to add a stop point for the parser.

You should know what kind of token sequence your rules will generate when matched against a all possible combinations of your document.

from eacc.eacc import Rule, Grammar, Eacc
from eacc.lexer import Lexer, LexTok, XSpec
from eacc.token import Plus, Minus, LP, RP, Mul, Div, Num, Blank, Sof, Eof

class CalcTokens(XSpec):
    # Used to extract the tokens.
    t_plus   = LexTok(r'\+', Plus)
    t_minus  = LexTok(r'\-', Minus)

    t_lparen = LexTok(r'\(', LP)
    t_rparen = LexTok(r'\)', RP)
    t_mul    = LexTok(r'\*', Mul)
    t_div    = LexTok(r'\/', Div)

    t_num    = LexTok(r'[0-9]+', Num, float)
    t_blank  = LexTok(r' +', Blank, discard=True)

    root = [t_plus, t_minus, t_lparen, t_num, 
    t_blank, t_rparen, t_mul, t_div]

class CalcGrammar(Grammar):
    # The token patterns when matched them become
    # ParseTree objects which have a type.
    r_paren = Rule(LP, Num, RP, type=Num)
    r_div   = Rule(Num, Div, Num, type=Num)
    r_mul   = Rule(Num, Mul, Num, type=Num)
    o_div   = Rule(Div)
    o_mul   = Rule(Mul)

    r_plus  = Rule(Num, Plus, Num, type=Num, up=(o_mul, o_div))
    r_minus = Rule(Num, Minus, Num, type=Num, up=(o_mul, o_div))

    # The final structure that is consumed. Once it is
    # consumed then the process stops.
    r_done  = Rule(Sof, Num, Eof)

    root = [r_paren, r_plus, r_minus, r_mul, r_div, r_done]

# The handles mapped to the patterns to compute the expression result.
def plus(expr, sign, term):
    return expr.val() + term.val()

def minus(expr, sign, term):
    return expr.val() - term.val()

def div(term, sign, factor):
    return term.val()/factor.val()

def mul(term, sign, factor):
    return term.val() * factor.val()

def paren(left, expression, right):
    return expression.val()

def done(sof, num, eof):
    print('Result:', num.val())
    return num.val()

data = '2 * 5 + 10 -(2 * 3 - 10 )+ 30/(1-3+ 4* 10 + (11/1))' 

lexer  = Lexer(CalcTokens)
tokens = lexer.feed(data)
eacc   = Eacc(CalcGrammar)

# Link the handles to the patterns.
eacc.add_handle(CalcGrammar.r_plus, plus)
eacc.add_handle(CalcGrammar.r_minus, minus)
eacc.add_handle(CalcGrammar.r_div, div)
eacc.add_handle(CalcGrammar.r_mul, mul)
eacc.add_handle(CalcGrammar.r_paren, paren)
eacc.add_handle(CalcGrammar.r_done, done)

ptree = eacc.build(tokens)
ptree = list(ptree)

The Rule up argument is used to specify operation prcedence as shown above.

Character Literals

You use the TokVal class to specify rules to match token values directly.

from eacc.eacc import Eacc,  Rule, Grammar
from eacc.lexer import XSpec, Lexer, LexTok, TokVal
from eacc.token import TokType, Eof, Sof, Num, Plus, Blank

class Type0(TokType):
    pass

class Type1(TokType):
    pass

class NumTokens(XSpec):
    t_num = LexTok(r'[1-9]+', type=Num)
    t_plus = LexTok(r'\+', type=Plus)

    t_blank = LexTok(r' +', type=Blank, discard=True)

    root = [t_num, t_plus, t_blank]

class NumGrammar(Grammar):
    r_type0 = Rule(TokVal('1'), TokVal('+'), TokVal('2'), type=Type0)
    r_type1 = Rule(Type0, TokVal('+'), TokVal('2'), type=Type1)
    r_end   = Rule(Sof, Type1, Eof)

    root = [r_type0, r_type1, r_end]

data = '1 + 2 + 2'
lexer = Lexer(NumTokens)
eacc  = Eacc(NumGrammar)
tokens = lexer.feed(data)
ptree  = eacc.build(tokens)
print('Consumed:', list(ptree))

Would output.

[tau@archlinux demo]$ python num_structs.py 
Consumed: [[Num('1'), Plus('+'), Num('2')], [[Num('1'), Plus('+'), Num('2')], Plus('+'), Num('2')], [Sof(''), [[Num('1'), Plus('+'), Num('2')], Plus('+'), Num('2')], Eof('')]]

Ambiguous Grammars

Eacc can handle ambiguous grammar in a handy manner. The example below shows a basic calculator. The approach is quite simple, you first define an Expr type then tell your rules to evaluate to that type. The ambiguity is solved using Rule precedence mechanism.

from eacc.eacc import Rule, Grammar, Eacc
from eacc.lexer import Lexer, LexTok, XSpec
from eacc.token import TokType, Plus, Minus, LP, RP, Mul, Div, Num, Blank, Sof, Eof

class Expr(TokType):
    pass

class CalcTokens(XSpec):
    # Token extractors. When it matches the regex's it instantiate
    # a Token class with the specified type.
    t_plus  = LexTok(r'\+', Plus)
    t_minus = LexTok(r'\-', Minus)

    t_lparen = LexTok(r'\(', LP)
    t_rparen = LexTok(r'\)', RP)
    t_mul    = LexTok(r'\*', Mul)
    t_div    = LexTok(r'\/', Div)

    # Automatically convert the token value to a float.
    t_num    = LexTok(r'[0-9]+', Num, float)

    # White spaces are discarded.
    t_blank  = LexTok(r' +', Blank, discard=True)

    root = [t_plus, t_minus, t_lparen, t_num, 
    t_blank, t_rparen, t_mul, t_div]

class CalcGrammar(Grammar):
    r_paren = Rule(LP, Expr, RP, type=Expr)
    r_div   = Rule(Expr, Div, Expr, type=Expr)
    r_mul   = Rule(Expr, Mul, Expr, type=Expr)

    # The lookahead rules to fix ambiguity.
    o_div   = Rule(Div)
    o_mul   = Rule(Mul)
    r_plus  = Rule(Expr, Plus, Expr, type=Expr, up=(o_mul, o_div))
    r_minus = Rule(Expr, Minus, Expr, type=Expr, up=(o_mul, o_div))
    r_num   = Rule(Num, type=Expr)
    r_done  = Rule(Sof, Expr, Eof)

    root = [r_paren, r_plus, r_minus, 
    r_mul, r_div, r_num, r_done]

def plus(expr, sign, term):
    return expr.val() + term.val()

def minus(expr, sign, term):
    return expr.val() - term.val()

def div(term, sign, factor):
    return term.val()/factor.val()

def mul(term, sign, factor):
    return term.val() * factor.val()

def paren(left, expression, right):
    return expression.val()

def num(num):
    """
    The result will be a ParseTree instance with type
    expression and value equal to the token num.
    """
    return num.val()

def done(sof, expression, eof):
    """
    When this pattern is trigged then the expression is
    fully evaluated. 
    """
    print('Result:', expression.val())
    return expression.val()

if __name__ == '__main__':
    data = '2 * 5 + 10 -(2 * 3 - 10 )+ 30/(1-3+ 4* 10 + (11/1))'
    lexer  = Lexer(CalcTokens)
    tokens = lexer.feed(data)
    eacc   = Eacc(CalcGrammar)
    
    # Map the patterns to handles to evaluate the expression.
    eacc.add_handle(CalcGrammar.r_plus, plus)
    eacc.add_handle(CalcGrammar.r_minus, minus)
    eacc.add_handle(CalcGrammar.r_div, div)
    eacc.add_handle(CalcGrammar.r_mul, mul)
    eacc.add_handle(CalcGrammar.r_paren, paren)
    eacc.add_handle(CalcGrammar.r_done, done)
    eacc.add_handle(CalcGrammar.r_num, num)
    
    # Finally build the AST.
    ptree = eacc.build(tokens)
    ptree = list(ptree)

Except Operator

The except operator is used to avoid tokens of some types to be consumed.

from eacc.eacc import Rule, Grammar, Eacc, Except
from eacc.lexer import Lexer, LexTok, XSpec
from eacc.token import Sof, Eof, One, Two, Three, Four, Five, Blank

class ExprTokens(XSpec):
    t_one   = LexTok(r'1', One)
    t_two   = LexTok(r'2', Two)

    t_three = LexTok(r'3', Three)
    t_four  = LexTok(r'4', Four)
    t_five  = LexTok(r'5', Five)
    t_blank = LexTok(r' +', Blank, discard=True)

    root = [t_one, t_two, t_three, t_four, t_five, t_blank]

class ExprGrammar(Grammar):
    r_one = Rule(One, Except(Three), One)
    r_sof = Rule(Sof)
    r_eof = Rule(Eof)
    root  = [r_one, r_sof, r_eof]

if __name__ == '__main__':
    print('Example 1')
    lexer = Lexer(ExprTokens)
    eacc  = Eacc(ExprGrammar)
    data  = '121 141'
    
    tokens = lexer.feed(data)
    ptree  = eacc.build(tokens)
    ptree  = list(ptree)
    print(ptree)
    
    print('\nExample 2')
    data   = '1 2 1 1 3 1' # Will fail.
    tokens = lexer.feed(data)
    ptree  = eacc.build(tokens)
    ptree  = list(ptree)
    

Only Operator

The Only operator makes sure a given token sequence of specific types are matched. In the above example if Except is changed for Only then it would match only tokens of type Three.

Times Operator

The Times operator can be used to express repeation and it can be mixed up with Except, Only operators.

from eacc.lexer import Lexer, LexTok, XSpec
from eacc.eacc import Grammar, Rule, Times, Eacc
from eacc.token import Blank, Num, Sof, Eof, LP, RP

class TupleTokens(XSpec):
    r_lparen = LexTok(r'\(', LP)
    r_rparen = LexTok(r'\)', RP)

    r_num    = LexTok(r'[0-9]+', Num)
    r_blank  = LexTok(r' +', Blank, discard=True)

    root = [r_lparen, r_rparen, r_num, r_blank]

class TupleGrammar(Grammar):
    # It means to accumulate as many Num tokens as possible.
    g_num = Times(Num, min=1, type=Num)

    # Then we trigge such a pattern in this rule.
    r_paren = Rule(LP, g_num, RP, type=Num)
    r_done  = Rule(Sof, Num, Eof)

    root = [r_paren, r_done]

def done(sof, expr, eof):
    print('Result:', expr)

if __name__ == '__main__':
    print('Example 1')
    data   = '(1 (1 1) ((((1)))))'
    
    lexer  = Lexer(TupleTokens)
    tokens = lexer.feed(data)
    eacc   = Eacc(TupleGrammar)
    
    ptree  = eacc.build(tokens)
    eacc.add_handle(TupleGrammar.r_done, done)
    
    ptree  = list(ptree)    

Errors

When the document is not entirely consumed it will raise an EaccError exception.