Skip to content

4. Building an extended syntax

wyfunique edited this page Apr 22, 2022 · 4 revisions

Overview

Syntax

In DBSim, the basic unit for organizing the implementations is syntax, which is a larger concept than the query language syntax. It refers to a scope or name space that manages a collection of implementations ranging from query syntax parsing to query execution functions. We call the native implementations of DBSim standard syntax. Similarly, the unit of organizing extensions is also syntax, named as extended syntax, which is a suite of the implementations for all the extended items scoped in the syntax.

extended syntax is incremental, i.e., it normally only includes the extended implementations based on standard syntax, like adding new operators or overwriting existing functions, instead of re-implementing everything in standard syntax.

Registry-based extension mechanism

DBSim deploys a registry-based mechanism to manage all the extended syntax.

Registry-based extension mechanism

The mechanism is centered at a registry that acts as a middleware between the kernel and extensions. Each row in the registry includes the identity of an extended syntax and a registry entry which maintains the interfaces for the syntax to mount its implementations. Specifically, a registry entry provides several entry points for the implementations of different purposes, including adding query keywords, adding data types and operators, inserting optimization rules, customizing the translation from logical to physical plan, etc. To make an extended syntax be in effect, users just need to register the implementations of the syntax in a registry entry and append it into the registry as a new row. When DBSim starts, the kernel will read in the registry and add the extended implementations into the standard syntax.

Levels of extension intefaces

Overall, DBSim provides two major levels of extension interfaces, the SQL-clause level and expression level.

  1. SQL-clause level extension:

    (1) Generally, a SQL query may include three clauses: SELECT, FROM, and WHERE clauses. DBSim allows users to extend them individually, like replacing the keyword 'select' with some others while keeping the 'from' and 'where' keywords unchanged. The interfaces of this level are placed in RegEntry of the registry, which is introduced in "Registry" section.

    (2) To facilitate the extension of this level, DBSim defines class SQLClause to represent the three clauses as SQLClause.SELECT, SQLClause.FROM and SQLClause.WHERE.

    (3) Note: since FROM clause may includes nested queries which makes the extension logic more complex and easy to have bugs, currently DBSim does not support SQLClause.FROM.

  2. Expression level extension:

    (1) This level includes the extensions of things in expressions, which are inside/finer than a SQL clause, like extended operators, extended data types, etc.

    (2) The extension interfaces of this level are provided by EntendedSyntax, which is introduced in "ExtendedSyntax" section.

For more information about the architecture and design principles of DBSim, please read our paper Extensible Database Simulator for Fast Prototyping In-Database Algorithms

Major modules for extension

To implement an extended syntax, you need to

  1. create a subclass of the ExtendedSyntax (defined in dbsim/extensions/extended_syntax/extended_syntax.py) and implement the extension interfaces, then
  2. append a new registry entry to registry (defined in dbsim/extensions/extended_syntax/registry.py) for the subclass

ExtendedSyntax

ExtendedSyntax provides several extension interfaces in the form of class attributes, which are defined as follows:

class ExtendedSyntax(object):
  """The abstract class for extended syntax"""
  __slots__ = {
    # plugin syntax symbols and data types
    '_extended_symbols_': '-> str',
    '_extended_clause_keywords_': '-> Dict[SQLClause, str]',  
    '_extended_data_types_': '-> Dict[Type[Const], str]', 
    '_extended_data_types_converter_': '-> Callable[[pandas.Series], [pandas.Series, FieldType]]', 
    # plugin predicate parsers.
    # the second parameter in this attribute presents whether to block errors raised by these parsers.
    # True -> blocking all errors from these parsers; False -> reporting all errors from them.
    # Note that, as an informational exception, ParsingFailure will always be blocked in both cases. 
    '_extended_predicate_parsers_': '-> typing.Tuple[Dict[PredExprLevel, Callable], bool]', 
    # plugin predicate operator executors, used by query compilers
    '_extended_predicate_op_executors_': '-> Dict[Type[SimpleOp], Callable[ [SimpleOp,Schema,DataSet], Callable[[row,ctx],Any] ]]',
    # plugin relational operator schema resolvers (used by schema_interpreter) and executors (used by compilers)
    '_extended_relation_op_schema_': '-> Dict[Type[SuperRelationalOp], Callable[[SuperRelationalOp, DataSet], Schema]]', 
    '_extended_relation_op_executors_': '-> Dict[Type[SuperRelationalOp], Callable[ [DataSet,SuperRelationalOp], Callable[[ctx],Any] ]]', 
  }

Here we briefly introduce the purpose of each attribute, later we will explain more details about them using concrete examples in the "Extension examples" section.

  1. _extended_symbols_: to add non-SQL symbols into query parser
  2. _extended_clause_keywords_: to add non-SQL keywords into corresponding SQL clauses (i.e., SELECT, FROM, and WHERE clause)
  3. _extended_data_types_: to add new data types
  4. _extended_data_types_converter_: to add converter functions that convert Pandas DataFrame data types into DBSim internal data types. The converters will only be used when loading data from csv/tsv files.
  5. _extended_predicate_parsers_: parsing functions for predicate operators (i.e., the operators only appearing in predicates/expressions, like And, Or, GtOp, LtOp, etc, which never modify the schema)
  6. _extended_predicate_op_executors_: executing functions for predicate operators, i.e., the physical implementation of the logical operators
  7. _extended_relation_op_schema_: schema resolving functions for relational operators (i.e., all the other operators which are used outside predicates, like SelectionOp, JoinOp, ProjectionOp, etc, which may modify the input relation schema and output data with a different schema)
  8. _extended_relation_op_executors_: executing functions/physical implementation for relational operators
  9. Note:
    • There is no extension interface of schema resolving for predicate operators because such operators never change the input relation schema.
    • There is no extension interface of parsers for relational operators because the extended parsers for such operators should be implemented in the SQL-clause level, i.e., in RegEntry.clause_parsers. See the following "Registry" section for details.
    • These class attributes are expected instead of required, i.e., they are optional for users' custom syntax. DBSim allows users to implement only part of them.

In addition to those class attributes for users to fill out, ExtendedSyntax also includes several class methods to mount the implementations of those attributes into corresponding DBSim modules. We call them "mounting methods":

  1. ExtendedSyntax.addExtendedSymbolsAndKeywords
  2. ExtendedSyntax.addExtendedPredicateParsers
  3. ExtendedSyntax.addExtendedDataTypes
  4. ExtendedSyntax.addExtendedPredicateOps
  5. ExtendedSyntax.addExtendedRelationOps

For example, ExtendedSyntax.addExtendedPredicateParsers() will add the extended predicate parsing functions (_extended_predicate_parsers_) into query_parser_toolbox.py. These class methods are completed and the users' syntax should not overwrite them. After implementing the attributes, users just need to fill these class methods of their syntax into the entry_points of RegEntry in the registry. See the following "Registry" section and the "Extension examples" section for more details.

Registry

registry is a OrderedDict including key-value pairs 'syntax name -> registry entry (RegEntry)'. RegEntry is defined as follows in dbsim/extensions/extended_syntax/registry_utils.py:

class RegEntry(object):
  """
  Each RegEntry represents a registered extended syntax. 
  """
  __slots__ = ("syntax", "clause_parsers", "entry_points", )
  def __init__(
    self, 
    syntax: Type[ExtendedSyntax], 
    clause_parsers: 'OrderedDict[SQLClause, typing.Tuple[TriggerFunc, ParserFunc]]',
    entry_points: List[Callable[[], None]]
  ) -> None:
    self.syntax = syntax
    self.clause_parsers = clause_parsers
    self.entry_points = entry_points

RegEntry has there attributes, where the last two are interfaces for extended syntax to mount their implementations.

  1. syntax: the class of the extended syntax
  2. clause_parsers: a mapping from SQLClause to the corresponding clause parsing function, i.e., the SQL-clause level extension interfaces
    • clause_parsers is an OrderedDict including key-value pairs "SQLClause -> (trigger_function, parsing_function)", where the parameters trigger_function and parsing_function are both function names, i.e., strings instead of the function objects. This is for dynamically calling the functions at runtime on any instance of the syntax, because the SQL clause parsers need to be implemented as instance methods (i.e., associated with self in Python) to avoid internal flags (if there are) to be overwritten in the case of nested SQL clauses. We illustrate such a case in "Extension examples" section.
    • trigger_function of an extended syntax is a method accepting a query and returning True/False to tell DBSim whether the current query satisfies that syntax. If True, the following parsing_function will be called/triggered to parse the current query.
    • When parsing a SQL clause, all the registered trigger_functions corresponding to this clause will be called in the registration order. And the first parsing_function whose trigger_function returns True will be used to parse the current SQL clause.
    • Generally, the parsing_function should parse the relational operators in the current SQL clause and call the expression level parsers to parse the finer elements like predicate operators in an expression. But most of time the workflow does not strictly follow such a pattern. We recommend to see the "Extension examples" to have a better understanding.
  3. entry_points: a list of entry points to connect the ExtendedSyntax interfaces (introduced above) with registry, i.e., the expression level extension interfaces. Normally, you just need to fill this list with all the "mounting methods" of your syntax which are inherited from ExtendedSyntax (mentioned in the "ExtendedSyntax" section).
    • When parsing an expression, all registered extended and the native standard predicate parsers will be tried in order (first all the extended parsers in registration order, finally the standard parser), and the first successful parser will be used.
    • In the implementation of an extended predicate parser, if it cannot handle the input query, it should raise an ParsingFailure exception (defined in dbsim/utils/exceptions.py), which will be captured during the process above to indicate this parser failed and let DBSim try the next one.

An example registry looks as follows (see section "Extension examples" for more details), where SimSelectionSyntax and SpatialSyntax are two example user-extended syntax we already implemented.

registry: Registry = OrderedDict({
  "simselect": RegEntry( 
                  syntax=SimSelectionSyntax, 
                  clause_parsers=OrderedDict({
                    SQLClause.SELECT: ("trigger_simselect", "parse_simselect"), 
                    SQLClause.WHERE: ("trigger_simselect_where", "parse_simselect_where")
                  }), 
                  entry_points=[
                    SimSelectionSyntax.addExtendedSymbolsAndKeywords,
                    SimSelectionSyntax.addExtendedPredicateParsers,
                    SimSelectionSyntax.addExtendedDataTypes,
                    SimSelectionSyntax.addExtendedPredicateOps,
                    SimSelectionSyntax.addExtendedRelationOps
                  ]
                ),
  "spitial": RegEntry( 
                  syntax=SpatialSyntax, 
                  clause_parsers=OrderedDict({
                    SQLClause.SELECT: ("trigger_spatialselect", "parse_spatialselect"), 
                    SQLClause.WHERE: ("trigger_spatialselect_where", "parse_spatialselect_where")
                  }), 
                  entry_points=[
                    SpatialSyntax.addExtendedSymbolsAndKeywords,
                    SpatialSyntax.addExtendedPredicateParsers,
                    SpatialSyntax.addExtendedDataTypes,
                    SpatialSyntax.addExtendedPredicateOps,
                    SpatialSyntax.addExtendedRelationOps
                  ]
                ),
})