generated from exaexa/better-mff-thesis
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathch3implementation.tex
264 lines (163 loc) · 21 KB
/
ch3implementation.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
\chapter{Implementation}
The implementation phase of the project is critical for translating the design into a functional system. This section details the setup of the development environment, focusing on the tools and technologies selected to ensure a robust and efficient development process. The implementation of the individual features is then discussed, providing detailed documentation of the coding process, including snippets and explanations of how the Ataccama Expression language features are implemented in Python. The testing and validation process is also described. Finally, any challenges faced during the implementation are discussed, along with the resolutions that were implemented to overcome them.
\section{Architecture}
The following diagram illustrates the class diagram of the transpiler \ref{fig:architecture}.
\begin{figure}[htbp]
\centering
\includegraphics[width=1.0\columnwidth]{diagrams/architecture.png}
\caption{Architecture overview}
\label{fig:architecture}
\end{figure}
The \texttt{ExpressionToPythonCompiler} is initiated with contexts that include external function handling capabilities. It registers various operators and functions that can be used within expressions, ensuring that these are correctly integrated into the compiled code. The lexer and parser transform the input expressions into a format that can be used to generate Python bytecode, with the visitor handling the actual translation into code and symbol resolution. The option for user to provide custom functions is also available.
The \texttt{IdentifierTable} plays a crucial role in managing the names for fields, variables, and functions. The \texttt{FunctionFactoryContext} and
\texttt{LookupFileResolver} provide mechanisms to incorporate external functionalities and data, enhancing the flexibility and power of the expression evaluation.
Finally, the \texttt{Expression} class encapsulates the executable code, providing a method to evaluate it with specific inputs. This architecture supports extensibility and customization, key traits for systems requiring dynamic data manipulation capabilities.
\section{Implementation of individual features}
% Provide detailed documentation of the coding process, including snippets and explanations of how Ataccama’s rules are implemented in Python.
This section delves into the technical specifics of implementing the key features of the Ataccama Expression Language in Python. The primary goal is to accurately interpret and execute the expression rules defined in Ataccama's custom language using Python tools and libraries.
\section{Expression Parsing}
The first step in processing Ataccama’s custom expression language in Python is to parse the expressions into a format that can be programmatically analyzed and executed. This is achieved using ANTLR, a powerful tool that generates a lexer and parser based on the grammar used in Ataccama ONE.
\subsubsection{ANTLR Lexer and Parser}
The lexer reads the raw input text and converts it into a stream of tokens based on the grammar rules defined for Ataccama’s language. The parser then takes these tokens and builds a parse tree.
\begin{verbatim}
sequence:
command+ expr
| expr
;
command:
dfunc
| assign SEMIC
;
assign:
(varName | name) ASSIGN_OP expr
;
\end{verbatim}
\subsubsection{Visitor Pattern}
From the parse tree, a visitor is generated—a component that traverses the parse tree. This visitor uses the visitor design pattern to execute operations based on the nodes of the parse tree. For the expression language, the visitor's primary role is to transform the parse tree into an \glsxtrfull{ast} using Python’s \texttt{ast} module, which can then be executed or evaluated in a Python environment. The visitor visits the non-terminals of the grammar and transforms them to \texttt{ast} objects or, where needed, some other intermediate objects.
The top-level non-terminal \texttt{main} otherwise known as the compilation unit is visited and the visitor returns an \texttt{ast} representation of Python module.
The module, when executed, stores the result of the expression in a variable with a predefined name and from which the result can be retrieved.
The following code snippet provides an example of how the visitor pattern is used to handle an AddExpr node in the parse tree:
\begin{verbatim}
def visitAddExpr(self, ctx: ExpressionsParser.AddExprContext):
if ctx.left:
op = ctx.op.children[0].symbol.type
if op == ExpressionsParser.PLUS:
name = Operator.ADD.value
elif op == ExpressionsParser.MINUS:
name = Operator.SUB.value
else:
raise ValueError(f"Unknown operator: \{op\}")
identifier = self.get_symbol_table()
.get_identifier_for_name(name,
"internal",
"load")
args = [self.visit(ctx.left), self.visit(ctx.right)]
node = self.create_call_node(identifier,
ctx.start.line,
ctx.stop.line,
args)
return ExpressionArgument(node)
return self.visitChildren(ctx)
\end{verbatim}
\subsubsection{Symbol table}
There are two layers of names: visible and internal. Internal names are used for internal workings: operators that are functions, multiline lambdas, etc. Visible names are the names that are visible to the user, i.e. callable functions, variables, record fields, etc.
The symbol table is also used to create child symbol tables, which handles the logic of symbol hiding, i.e. when a symbol is defined in a scope, it hides the symbol with the same name in the parent scope.
The use of a symbol table in our system is not primarily due to limitations within Python’s name resolution and scoping capabilities. Instead, it addresses specific challenges that arise in our context. Without a symbol table, name resolution would have to occur at runtime, which complicates any future implementation of name existence checks. Additionally, it is good practice to prevent users from accessing internal names related to operators or multiline lambda functions. By using a symbol table, we can simplify the mapping of complex names, such as converting \texttt{math.tan} to a valid identifier such as \texttt{math\_tan}, while also providing a robust system for detecting and handling name collisions.
Lastly, the symbol table provides an infrastucture for compile-time type checking, should we decide to implement it in the future.
\subsubsection{Error Handling}
Syntax errors in the expressions are handled using ANTLR's error listeners. These listeners are customized to provide meaningful error messages that help identify and correct syntax issues in the input expressions.
\subsection{Statements}
Handling variables and functions within expressions involves maintaining a symbol table where each variable's name and value are stored. Variables are parsed and evaluated by the visitor, which checks the symbol table to resolve their values during the execution of expressions.
The symbol table keeps track of all symbols used in expressions. It ensures that each variable is correctly declared and used within its scope. Additionally, symbol transformation is employed to prevent collisions and ensure that variable names are unique within the global execution context.
\subsection{Functions and Operators}
Implementing functions and operators in the Python version of the Ataccama Expression Language involves defining Python functions that correspond to each function and operator in the original language.
Each function from Ataccama’s language is mapped to a Python function. These functions handle various data types and perform the necessary computations or data manipulations as defined in the Ataccama language specifications.
Each operator is mapped to a function call with the operand(s) as the arguments. This allows for customizing behavior for arithmetic, logical, and comparison operations to closely align with how they function in the original implementation.
The reimplementation of the functions and operators constitutes a significant portion of the work, as the language supports a wide range of operations that need to be accurately translated into Python.
\subsection{Additional Features and Utilities}
In addition to the core features of the Ataccama Expression Language, several utilities and enhancements are implemented.
\subsubsection{Command Line Interface (CLI)}
A command-line interface is developed to allow users to interact with the expression evaluation engine directly. This CLI provides a simple way to input expressions and receive the output, making it easier to test and validate the implementation.
Running `poetry install` will make the `evaluate-records` script available.
`evaluate-records` can be used to evaluate records on an expression:
\begin{verbatim}
echo "John,25\nJohn,18" | evaluate-records "name == 'John' and
age > 20" --record-format "name:STRING,age:INTEGER"
\end{verbatim}
\subsubsection{Expression Generator}
An expression generator is created to produce random expressions based on the grammar of the Ataccama Expression Language. This tool is useful for testing the parser and visitor components.
\section{Testing and Validation}
% Explain how you tested the reimplementation of data quality rules to ensure they work correctly and efficiently in local Python environments.
Testing and validation play crucial roles in software development, particularly when reimplementing data quality rules to verify their performance in Python environments. In this project, we utilize \texttt{pytest}, leveraging its parametrization features to rigorously test the implementation of Ataccama's data quality rules in Python.
The use of test parametrization allows for running the same test function with different input values. It is particularly useful in this project for applying a wide array of test scenarios to the implemented functions to ensure comprehensive coverage and that the rules behave as intended across diverse data sets and conditions.
The primary focus during testing is to ensure that the reimplemented rules behave as similarly as possible to the Ataccama Expression Language, this is achieved using unit tests. Tests are designed to validate both typical and edge-case scenarios, ensuring the rules are robust under various data conditions.
The test suite contains more than 1300 tests, covering the wide range of functions, operators, and expressions. These tests are designed to validate the correctness defined by the Ataccama implementation and ensure that the Python version produces the same results in as many cases as possible.
This extensive testing process is crucial for identifying and resolving any discrepancies between the original and re-implemented rules, ensuring that the Python implementation is accurate and reliable for data quality checks. As such it is a key component of the development process, providing confidence in the correctness and efficiency of the reimplementation.
Expected outputs are set with the help of the original Ataccama ONE Expressions Java engine, which serves as a reference for the Python implementation. This ensures that the Python version produces consistent results with the established behavior of the rules in the Ataccama environment. This is aided by the tool for output comparison, which is discussed in the next section.
\subsection{Tool for Output Comparison}
To validate the accuracy of the reimplementation, we have developed a test suite that compares the outputs of our Python implementation against those generated by the original Ataccama ONE Expressions Java engine. This ensures that our implementation produces results consistent with the established behavior of the rules in the Ataccama environment.
This test suite can be run programmatically using a script which runs the \texttt{pytest} test suite; the suite is using a fixture for running any expression that captures it and if enabled, saves it into a file. The script executes the tests and, for a subset of them, runs the same expressions in the Ataccama ONE Desktop environment. It then compares the outputs from both implementations and highlights any discrepancies.
The output can be used to identify any inconsistencies between the original and re-implemented rules, allowing for further refinement and debugging to ensure the rules are correctly implemented.
\section{Challenges and Resolutions}
% Discuss any challenges faced during the implementation and how they were resolved.
\subsection{Date formatting strings}
\paragraph{Problem}
One of the challenges encountered during the implementation was handling date formatting strings in the Ataccama Expression Language. The original language mirrors Java date formatting patterns \cite{datetimeformatter}, which are not directly compatible with Python’s \texttt{datetime} module \cite{datetime}.
\paragraph{Resolution}
To address the date formatting issue, a mapping between Java and Python date formatting patterns was created. This mapping allows the Python implementation to interpret the date formatting strings correctly and convert them to the appropriate Python format.
To perform this mapping, a separate grammar with lexer and parser was added to parse the original date formatting string into pattern and text tokens. Pattern tokens are then converted using a lookup table.
Related code is located in the \texttt{ataccama.expressions.dateformat} package of \ref{app:transpiler}.
\subsection{Multiline lambda functions}
\paragraph{Problem}
Another challenge was implementing multiline lambda functions in Python. The Ataccama Expression Language supports multiline lambda functions, which are not directly supported in Python\cite{pythonlambdas}. This required finding a workaround to enable multiline lambda functions in the Python implementation.
\paragraph{Resolution}
The solution involved defining full-fledged functions instead of using lambda expressions for multiline needs. To integrate these functions seamlessly and avoid namespace conflicts, we employed a symbol table that manages and mangles names dynamically. This approach ensures that all function names are unique and avoids identifier collisions within the Python environment, effectively replicating the flexibility of Ataccama's multiline lambdas within Python's syntactic constraints.
\subsection{Lookup Files}
\paragraph{Problem}
Ataccama's expressions can perform lookups against reference data stored in proprietary binary formats with sophisticated hashing strategies. This feature is crucial for validating data against predefined sets which are optimized for performance in Ataccama's native environment.
\paragraph{Solution}
To handle this, the proprietary lookup functionality was reimplemented in Python. This involved developing a method to read and interpret the binary format into a usable form in Python. Additionally, to mimic the fixed-size arithmetic and specialized hashing used by Ataccama, similar algorithms were implemented in Python, ensuring that the lookup performance remains efficient and consistent with the original implementation.
\subsection{Null Handling}
\paragraph{Problem}
Ataccama's functions and operators are designed to handle null values gracefully, often returning a null or a neutral value when encountering nulls in expressions. This feature is essential for maintaining data integrity and ensuring robust data quality checks.
\paragraph{Solution}
The Python implementation adopted a similar approach to null handling. Custom operators and functions were developed to replicate the behavior of Ataccama's handling of nulls. For example, custom implementations of addition (+) and other operators were created to return null or appropriate neutral values when encountering null inputs. This ensures that the data quality rules continue to function predictably and effectively even when faced with incomplete or missing data.
This careful replication of functionality ensures that the Python version of Ataccama's data quality rules maintains the robustness and reliability of the original system, adhering closely to its operational logic and data handling practices.
\section{Development Environment Setup}
% Describe the setup of your development environment necessary.
For this project, a modern and efficient development environment is set up to facilitate the coding, testing, and deployment phases. The environment leverages several key tools and technologies designed to enhance productivity and ensure the quality of the software developed. Below is a breakdown of the core components of the development setup:
\subsection{Poetry for Dependency Management and Package Publishing}
Poetry\cite{poetry} is utilized as the primary tool for dependency management and package publishing. It offers a streamlined approach to manage libraries and dependencies, ensuring that the project environment is reproducible and consistent across different setups. Poetry simplifies the management of project dependencies, and its lock file ensures that the same versions are used in every environment, reducing "works on my machine" problems.
\subsubsection{Configuration}
The \texttt{pyproject.toml} file is configured to list all necessary libraries and their specific versions. This file also includes configurations for package metadata, making it easier to package and distribute the final software if needed.
\subsection{Pypy for Type Checks}
Given the dynamic nature of Python, mypy\cite{mypy} is incorporated to provide optional static type checking. Mypy is an optional static type checker for Python, designed to combine the benefits of dynamic typing and static typing. By annotating Python code with type hints, Mypy can catch many programming errors before they manifest at runtime. It enhances code quality and reliability, especially in large and complex projects where types play a crucial role in the correctness of the program.
\subsubsection{Configuration}
Mypy is configured to run as part of the continuous integration process, checking type annotations during development. Some leniencies are allowed in the configuration to strike a balance between strict type checking and development flexibility. For instance, certain third-party libraries without type hints and generated code like the lexer and parser might be excluded from these checks to prevent excessive false positives.
\subsection{Pytest for Testing}
Pytest\cite{pytest} has been selected as the preferred testing framework for this project also due to its support for parametrized testing and test fixtures.
\subsubsection{Configuration}
Tests are written to cover various cases, from basic unit tests that validate each function's behavior with different inputs to integration tests that ensure that the system components work together as expected. Pytest fixtures\cite{pytest_fixtures} are used to set up and teardown test environments, making it easy to manage test states and dependencies.
\subsection{Additional Tools and Practices}
\subsubsection{Version Control}
Git is used for version control, with a repository hosted on the company GitLab, providing a robust framework for collaboration and version tracking.
\subsubsection{Continuous Integration/Continuous Deployment (CI/CD)}
CI/CD pipeline is set up using GitLab CI/CD to automate the testing and deployment process. The pipeline is configured to run tests on every commit and deploy the application to a staging environment if the tests pass. This setup ensures that the software is continuously tested and can be deployed automatically to a production environment when ready.
\section{Example usage}
To demonstrate the usage of the Python implementation of Ataccama's data quality rules, we provide a simple example of evaluating an expression on a set of records. The following code snippet shows how to evaluate an expression on a list of records using the Python implementation:
\begin{verbatim}
records = ...
compiler = create_compiler()
expression_str = ('( NOT ( lower(continent) in '
'\{ "asia", "africa", "europe", '
'"north america", "south america", '
'"oceania", "antarctica" \} ) )')
expression = compiler.compile(expression_str)
for record in records:
if expression.evaluate(record):
...
\end{verbatim}
More examples are provided as part of appendix \ref{app:transpiler} in the \texttt{examples} directory in the form of \texttt{jupyter} \cite{jupyter} notebooks with simple demonstrations followed by explanations and commentaries.
\section{Summary}
In the implementation the goal of keeping compatibility was stressed and the Python implementation was designed to closely mirror the behavior of the original Ataccama Expression Language. This was addressed on a problem by problem basis, with each challenge met with a specific solution that ensured the Python version behaved as closely as possible to the original. Furthermore, a wide suite of tests was developed to validate the correctness of the reimplementation.
The goal of enabling local execution so as to avoid the need for a connection to the Ataccama ONE environment was achieved, with the Python implementation providing a standalone solution for evaluating data quality rules.