-
Notifications
You must be signed in to change notification settings - Fork 2
4. Building an extended syntax
In DBSim, the basic unit for organizing the implementations is syntax
, which is a larger concept than the query language syntax. It refers to a scope or name space that manages a collection of implementations ranging from query syntax parsing to query execution functions. We call the native implementations of DBSim standard syntax
. Similarly, the unit of organizing extensions is also syntax
, named as extended syntax
, which is a suite of the implementations for all the extended items scoped in the syntax
.
extended syntax
is incremental, i.e., it normally only includes the extended implementations based on standard syntax
, like adding new operators or overwriting existing functions, instead of re-implementing everything in standard syntax
.
DBSim deploys a registry-based mechanism to manage all the extended syntax
.
The mechanism is centered at a registry that acts as a middleware between the kernel and extensions. Each row in the registry includes the identity of an extended syntax
and a registry entry which maintains the interfaces for the syntax
to mount its implementations. Specifically, a registry entry provides several entry points for the implementations of different purposes, including adding query keywords, adding data types and operators, inserting optimization rules, customizing the translation from logical to physical plan, etc. To make an extended syntax
be in effect, users just need to register the implementations of the syntax
in a registry entry and append it into the registry as a new row. When DBSim starts, the kernel will read in the registry and add the extended implementations into the standard syntax
.
Overall, DBSim provides two major levels of extension interfaces, the SQL-clause level and expression level.
-
SQL-clause level extension:
(1) Generally, a SQL query may include three clauses: SELECT, FROM, and WHERE clauses. DBSim allows users to extend them individually, like replacing the keyword 'select' with some others while keeping the 'from' and 'where' keywords unchanged. The interfaces of this level are placed in
RegEntry
of theregistry
, which is introduced in "Registry" section.(2) To facilitate the extension of this level, DBSim defines class
SQLClause
to represent the three clauses asSQLClause.SELECT
,SQLClause.FROM
andSQLClause.WHERE
.(3) Note: since FROM clause may includes nested queries which makes the extension logic more complex and easy to have bugs, currently DBSim does not support
SQLClause.FROM
. -
Expression level extension:
(1) This level includes the extensions of things in expressions, which are inside/finer than a SQL clause, like extended operators, extended data types, etc.
(2) The extension interfaces of this level are provided by
EntendedSyntax
, which is introduced in "ExtendedSyntax" section.
For more information about the architecture and design principles of DBSim, please read our paper Extensible Database Simulator for Fast Prototyping In-Database Algorithms
To implement an extended syntax
, you need to
- create a subclass of the
ExtendedSyntax
(defined indbsim/extensions/extended_syntax/extended_syntax.py
) and implement the extension interfaces, then - append a new registry entry to
registry
(defined indbsim/extensions/extended_syntax/registry.py
) for the subclass
ExtendedSyntax
provides several extension interfaces in the form of class attributes, which are defined as follows:
class ExtendedSyntax(object):
"""The abstract class for extended syntax"""
__slots__ = {
# plugin syntax symbols and data types
'_extended_symbols_': '-> str',
'_extended_clause_keywords_': '-> Dict[SQLClause, str]',
'_extended_data_types_': '-> Dict[Type[Const], str]',
'_extended_data_types_converter_': '-> Callable[[pandas.Series], [pandas.Series, FieldType]]',
# plugin predicate parsers.
# the second parameter in this attribute presents whether to block errors raised by these parsers.
# True -> blocking all errors from these parsers; False -> reporting all errors from them.
# Note that, as an informational exception, ParsingFailure will always be blocked in both cases.
'_extended_predicate_parsers_': '-> typing.Tuple[Dict[PredExprLevel, Callable], bool]',
# plugin predicate operator executors, used by query compilers
'_extended_predicate_op_executors_': '-> Dict[Type[SimpleOp], Callable[ [SimpleOp,Schema,DataSet], Callable[[row,ctx],Any] ]]',
# plugin relational operator schema resolvers (used by schema_interpreter) and executors (used by compilers)
'_extended_relation_op_schema_': '-> Dict[Type[SuperRelationalOp], Callable[[SuperRelationalOp, DataSet], Schema]]',
'_extended_relation_op_executors_': '-> Dict[Type[SuperRelationalOp], Callable[ [DataSet,SuperRelationalOp], Callable[[ctx],Any] ]]',
}
Here we briefly introduce the purpose of each attribute, later we will explain more details about them using concrete examples in the "Extension examples" section.
-
_extended_symbols_
: to add non-SQL symbols into query parser -
_extended_clause_keywords_
: to add non-SQL keywords into corresponding SQL clauses (i.e., SELECT, FROM, and WHERE clause) -
_extended_data_types_
: to add new data types -
_extended_data_types_converter_
: to add converter functions that convert Pandas DataFrame data types into DBSim internal data types. The converters will only be used when loading data from csv/tsv files. -
_extended_predicate_parsers_
: parsing functions for predicate operators (i.e., the operators only appearing in predicates/expressions, likeAnd
,Or
,GtOp
,LtOp
, etc, which never modify the schema) -
_extended_predicate_op_executors_
: executing functions for predicate operators, i.e., the physical implementation of the logical operators -
_extended_relation_op_schema_
: schema resolving functions for relational operators (i.e., all the other operators which are used outside predicates, likeSelectionOp
,JoinOp
,ProjectionOp
, etc, which may modify the input relation schema and output data with a different schema) -
_extended_relation_op_executors_
: executing functions/physical implementation for relational operators -
Note:
- There is no extension interface of schema resolving for predicate operators because such operators never change the input relation schema.
- There is no extension interface of parsers for relational operators because the extended parsers for such operators should be implemented in the SQL-clause level, i.e., in
RegEntry.clause_parsers
. See the following "Registry" section for details. - These class attributes are expected instead of required, i.e., they are optional for users' custom syntax. DBSim allows users to implement only part of them.
In addition to those class attributes for users to fill out, ExtendedSyntax
also includes several class methods to mount the implementations of those attributes into corresponding DBSim modules. We call them "mounting methods":
ExtendedSyntax.addExtendedSymbolsAndKeywords
ExtendedSyntax.addExtendedPredicateParsers
ExtendedSyntax.addExtendedDataTypes
ExtendedSyntax.addExtendedPredicateOps
ExtendedSyntax.addExtendedRelationOps
For example, ExtendedSyntax.addExtendedPredicateParsers()
will add the extended predicate parsing functions (_extended_predicate_parsers_
) into query_parser_toolbox.py
. These class methods are completed and the users' syntax should not overwrite them. After implementing the attributes, users just need to fill these class methods of their syntax into the entry_points
of RegEntry
in the registry
. See the following "Registry" section and the "Extension examples" section for more details.
registry
is a OrderedDict
including key-value pairs 'syntax name -> registry entry (RegEntry
)'. RegEntry
is defined as follows in dbsim/extensions/extended_syntax/registry_utils.py
:
class RegEntry(object):
"""
Each RegEntry represents a registered extended syntax.
"""
__slots__ = ("syntax", "clause_parsers", "entry_points", )
def __init__(
self,
syntax: Type[ExtendedSyntax],
clause_parsers: 'OrderedDict[SQLClause, typing.Tuple[TriggerFunc, ParserFunc]]',
entry_points: List[Callable[[], None]]
) -> None:
self.syntax = syntax
self.clause_parsers = clause_parsers
self.entry_points = entry_points
RegEntry
has there attributes, where the last two are interfaces for extended syntax to mount their implementations.
-
syntax
: the class of the extended syntax -
clause_parsers
: a mapping fromSQLClause
to the corresponding clause parsing function, i.e., the SQL-clause level extension interfaces-
clause_parsers
is anOrderedDict
including key-value pairs "SQLClause -> (trigger_function, parsing_function)", where the parameterstrigger_function
andparsing_function
are both function names, i.e., strings instead of the function objects. This is for dynamically calling the functions at runtime on any instance of the syntax, because the SQL clause parsers need to be implemented as instance methods (i.e., associated withself
in Python) to avoid internal flags (if there are) to be overwritten in the case of nested SQL clauses. We illustrate such a case in "Extension examples" section. -
trigger_function
of an extended syntax is a method accepting a query and returning True/False to tell DBSim whether the current query satisfies that syntax. If True, the followingparsing_function
will be called/triggered to parse the current query. - When parsing a SQL clause, all the registered trigger_functions corresponding to this clause will be called in the registration order. And the first parsing_function whose trigger_function returns True will be used to parse the current SQL clause.
- Generally, the parsing_function should parse the relational operators in the current SQL clause and call the expression level parsers to parse the finer elements like predicate operators in an expression. But most of time the workflow does not strictly follow such a pattern. We recommend to see the "Extension examples" to have a better understanding.
-
-
entry_points
: a list of entry points to connect theExtendedSyntax
interfaces (introduced above) withregistry
, i.e., the expression level extension interfaces. Normally, you just need to fill this list with all the "mounting methods" of your syntax which are inherited fromExtendedSyntax
(mentioned in the "ExtendedSyntax" section).- When parsing an expression, all registered extended and the native standard predicate parsers will be tried in order (first all the extended parsers in registration order, finally the standard parser), and the first successful parser will be used.
- In the implementation of an extended predicate parser, if it cannot handle the input query, it should raise an
ParsingFailure
exception (defined indbsim/utils/exceptions.py
), which will be captured during the process above to indicate this parser failed and let DBSim try the next one.
An example registry
looks as follows (see section "Extension examples" for more details), where SimSelectionSyntax
and SpatialSyntax
are two example user-extended syntax we already implemented.
registry: Registry = OrderedDict({
"simselect": RegEntry(
syntax=SimSelectionSyntax,
clause_parsers=OrderedDict({
SQLClause.SELECT: ("trigger_simselect", "parse_simselect"),
SQLClause.WHERE: ("trigger_simselect_where", "parse_simselect_where")
}),
entry_points=[
SimSelectionSyntax.addExtendedSymbolsAndKeywords,
SimSelectionSyntax.addExtendedPredicateParsers,
SimSelectionSyntax.addExtendedDataTypes,
SimSelectionSyntax.addExtendedPredicateOps,
SimSelectionSyntax.addExtendedRelationOps
]
),
"spitial": RegEntry(
syntax=SpatialSyntax,
clause_parsers=OrderedDict({
SQLClause.SELECT: ("trigger_spatialselect", "parse_spatialselect"),
SQLClause.WHERE: ("trigger_spatialselect_where", "parse_spatialselect_where")
}),
entry_points=[
SpatialSyntax.addExtendedSymbolsAndKeywords,
SpatialSyntax.addExtendedPredicateParsers,
SpatialSyntax.addExtendedDataTypes,
SpatialSyntax.addExtendedPredicateOps,
SpatialSyntax.addExtendedRelationOps
]
),
})