Skip to content

Latest commit

 

History

History
267 lines (195 loc) · 11.7 KB

README.detail.md

File metadata and controls

267 lines (195 loc) · 11.7 KB

Build Status

XLang

Copyright (C) 2011-2013 onlyuser mailto:onlyuser@gmail.com

About

picture alt

XLang is a parser framework for language modeling. It is also a demonstration of various ways one could build a yacc-based parser. Seven variations of parser are presented, varying in features.

The first variation (0_XLang_full), is a full-featured parser. With each successive variation, non-critical features are stripped until there is nothing left but a flex-less parser, the last variation (7_XLang_no-flex).

full, no-strings, no-comments, no-locations, no-reentrant, stdio, file, no-flex

The order in which features are removed is completely arbitrary. There's no good reason one cannot cannot have both "comments" and "file input" enabled, or both "reentrant" and "hand-rolled lexer" enabled.

The code is organized in a way such that anyone unfamiliar with Lex-Yacc can quickly isolate a variation that is most relevent, and most readily customizable to his/her specific requirements.

Variation vs. Feature Table

variation lexer input reentrant locations comments strings
full flex buffer yes yes yes yes
no-strings flex buffer yes yes yes no
no-comments flex buffer yes yes no no
no-locations flex buffer yes no no no
no-reentrant flex buffer no no no no
stdio flex stdio no no no no
file flex file no no no no
file-yyin flex file no no no no
no-flex hand-rolled file no no no no
HINT: The "stdio" variation is easiest to implement -- The flex lexer
      accepts standard stream input by default. It takes extra effort to
      hack flex into using "buffer input" or "file input".

Feature Detail

lexer:

    flex: Use flex to generate a lexer.
          This option is preferred when regex significantly simplifies the
          lexer.

    hand-rolled: Use a custom lexer.
                 This option is preferred when speed is critical -- Flex is
                 slow. This option is also preferred when unicode support
                 is needed -- Flex has no unicode support as of the time of
                 this writing.

input:

    buffer: Use in-memory buffer as input.
            This option is preferred when speed is critical -- Standard
            streams are slow. This option is also preferred when the input
            string already resides in memory and does not require passing
            through a standard stream in order to be used.

    stdio: Use standard input.
           This is the default behavior.

reentrant: Tell the parser not to use global variables.
           This option is required to make the parser multi-thread
           compliant.

locations: Tell the parser to provide a location variable "$@" for each BNF
           production. This option demonstrates the standard way to track
           syntax error line/column number using Lex-Yacc.

comments: Tell the lexer to conditionally enable/disable certain lexer
          rules for stateful scanning. This option demonstrates the
          standard way to ignore line/block comments.

strings: Tell the lexer to conditionally enable/disable certain lexer rules
         for stateful scanning. This option demonstrates *one* of the many
         ways to handle strings, characters, and escaped characters.

Requirements

Unix tools and 3rd party components (accessible from $PATH):

gcc (with -std=c++0x support), flex, bison, valgrind, cppcheck, doxygen, graphviz, ticpp

Environment variables:

  • $INCLUDE_PATH_EXTERN -- where "ticpp/ticpp.h" resides
  • $LIB_PATH_EXTERN -- where "libticppd.a" resides

Make Targets

target action
all make binaries
test all + run tests
pure test + use valgrind to check for memory leaks
dot test + generate .png graph for tests
lint use cppcheck to perform static analysis on .cpp files
doc use doxygen to generate documentation
xml test + generate .xml for tests
import test + use ticpp to serialize-to/deserialize-from xml
clean remove all intermediate files

FAQ

  1. What is XLang ?

    XLang is a starting point for people looking to construct their own language using Lex-Yacc. XLang is thoroughly tested and comes with its own test suite.

  2. What isn't XLang ?

    XLang isn't a unified parser front-end for every language under the sun. It makes some assumptions about the target language, meaning, literal values must be C-like. Otherwise, it places no restrictions on the grammar.

  3. How should one use XLang ?

    I recommend hacking the "stdio" example to your liking, but any of the other variations equally suitable as a base for beginners.

  4. What licenses apply when using XLang ?

    GPL3.

  5. What's the motivation behind writing XLang ?

    Lex-Yacc is a pragmatic solution that will get the job done, but it doesn't do so right out of the box. It needs some infrastructure to get most sizeable projects going.

    I built XLang out of my own need for a parser SDK that takes care of the tedious tree building business so I could focus more on the fun stuff -- AST visitation.

    But one may ask, isn't this exactly what production rule actions are for ? No. In order to perform multi-pass visitation to an AST, using Yacc, one would have to re-parse the input everytime, and that's wasteful. XLang builds an AST for you so you can visit it as many times as needed.

    To make a long answer short, Tom Niemann's Lex-Yacc tutorial is my primary source of inspiration for building XLang. Please visit his site to learn more.

  6. How does XLang differ from the example in Tom's tutorial ?

    a) XLang uses C++, with unions used only where necessary. Tom's tutorial uses a C-style polymorphism technique where the last member of a class is a union of several types. I prefer C++ polymorphism.

    b) XLang has its own memory management system based on allocators. Tom's tutorial omits AST node freeing entirely.

    c) XLang offers several flavors of Lex-Yacc whereas Tom's tutorial offers just one "stdio" parser.

    d) XLang's ASTs are "flattened", meaning lists are interpreted as lists instead of deep-recursing binary trees. This tree organization lends itself better to AST visitation (less likely to stack-overflow).

    (1+2+3)        (1+2+3)
    non-flattened: flattened:
    
        +             +
       / \          / | \
      +   3        1  2  3
     / \
    1   2
    

    e) XLang's visitor borrows concepts from Anand Shankar Krishnamoorthi's cooperative visitor and Jeremy Blosser's Java Tip 98 to enable visit double-dispatch without the need to implement the "accept" method for every single AST node class.

    f) XLang uses namespaces and isolates all the AST building code into a static library.

  7. Why allocators ?

    I've incorporated a simple memory allocator in this project because I want to keep my AST node classes as clean as possible, without destructors that delete child nodes. But the way the allocator was written really doesn't improve the program's memory usage patterns. With some effort, I suppose one could upgrade that into a real allocator with true memory pooling.

  8. Why c++0x ?

    Lambda functions are one of the few c++0x feature used here, and only because they solve the problem elegantly. The goal is to deliberately avoid implementing clean-up code within the AST node destructors. This goes along a grander design decision to keep the AST node class clear of method clutter by moving all the predicate code into visitors. Lambdas help provide ad hoc anonymous callbacks for the allocator, whose job is to perform clean-up originally intended for the destructor.

References

"Flex your lexical analysis muscles"
http://www.codeguru.com/cpp/cpp/algorithms/strings/article.php/c12717/Flex-Your-Lexical-Analysis-Muscles.htm
http://www.developer.com/net/cplus/article.php/3636641/Flex-Your-Lexical-Analysis-Muscles.htm
<dt>"Classic Parsing with Flex-Bison"</dt>
<dd>http://www.codeguru.com/csharp/.net/net_general/patterns/article.php/c12805/Classic-Parsing-with-Flex-and-Bison.htm</dd>
<dd>http://www.developer.com/net/cplus/article.php/3642516/Classic-Parsing-with-Flex-and-Bison.htm</dd>

<dt>"Tom Niemann Flex-Bison AST examples"</dt>
<dd>http://epaperpress.com/lexandyacc/</dd>

<dt>"UsualCoding.eu Reentrant Flex-Bison example"</dt>
<dd>http://www.usualcoding.eu/post/2007/09/03/Building-a-reentrant-parser-in-C-with-Flex/Bison</dd>
<dd>(code example needs "void yyerror() {}" inserted in declaration section of parser.y)</dd>
<dd>(page seems down as of May 19, 2012)</dd>

<dt>"Make a reentrant parser with Flex and Bison"</dt>
<dd>http://www.lemoda.net/c/reentrant-parser/index.html</dd>

<dt>"Better error handling using Flex and Bison"</dt>
<dd>http://www.ibm.com/developerworks/library/l-flexbison.html</dd>

<dt>"ProgTools Flex-Bison AST tutorial (with classes)"</dt>
<dd>http://www.progtools.org/compilers/tutorials/cxx_and_bison/cxx_and_bison.html</dd>

<dt>"O-Reilly Lex-Yacc book examples"</dt>
<dd>http://examples.oreilly.com/lex/</dd>

<dt>"Cooperative VisitorIFace: A Template Technique for VisitorIFace Creation"</dt>
<dd>http://www.artima.com/cppsource/cooperative_visitor.html</dd>

<dt>"Java Tip 98: Reflect on the VisitorIFace design pattern"</dt>
<dd>http://www.javaworld.com/javaworld/javatips/jw-javatip98.html</dd>

Additional Reading

Keywords

Lex, Yacc, Flex, Bison, Parser, Reentrant