generated from exaexa/better-mff-thesis
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathch1analysis.tex
281 lines (160 loc) · 22.9 KB
/
ch1analysis.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
\chapter{Analysis}
\label{chap:analysis}
There are many benefits to having \glsxtrfull{dqm} processes in place. In fact, in today's data-driven environments, the assurance of data quality is not just a preference but a critical necessity. Organizations rely on accurate, timely, and reliable data to make informed decisions, drive strategies, and optimize operations. As such, the field of \glsxtrshort{dqm} has evolved to address these needs through sophisticated tools and methodologies. However, the effective implementation of these tools requires a deep understanding of both the tools themselves and the roles of those who interact with them.
This chapter delves into the analysis of the current landscape of \glsxtrshort{dqm}, focusing particularly on the need for programmatic access to these tools. This need stems from the growing requirement to seamlessly integrate data quality solutions into existing data pipelines, a task that typically falls within the purview of data engineers. As we explore this topic, we will discuss the challenges faced by data engineers.
\section{Similar Solutions}
This section compares Ataccama ONE to other \glsxtrshort{dqm} tools that are designed for programmatic access, highlighting the relative strengths and limitations of each solution in terms of features, technology stack, and integration capabilities.
\subsection{Soda Core}
Soda Core \cite{sodacore} is a robust open-source tool tailored to integrate data quality checks directly into data pipelines. However, its feature set primarily focuses on:
\begin{itemize}
\item Data Monitoring and Alerting
Automatically detecting and alerting on anomalies in data as it flows through pipelines.
\item Customizable Quality Checks
While flexible, these are generally more basic compared to the depth provided by comprehensive \glsxtrshort{dqm} platforms.
\item Python Integration
Strong integration with Python-based data ecosystems, suitable for teams relying heavily on Python for data processing.
\end{itemize}
\subsection{Great Expectations}
Great Expectations \cite{great_expectations} offers a framework for setting up complex data validation and documentation, which is crucial for maintaining high data quality standards. Its key features include:
\begin{itemize}
\item Validation Framework
Extensive support for defining expectations about data, which can be automatically validated against data batches.
\item Data Docs
Automatically generated documentation that helps keep teams aligned on data quality standards.
\item Integration
While it offers broad integration capabilities, it requires significant setup and maintenance compared to more out-of-the-box solutions.
\end{itemize}
\subsection{Comparison with Ataccama ONE}
Ataccama ONE offers a more extensive suite of features than either Soda Core or Great Expectations. This includes advanced functionalities like Master Data Management, Data Cataloging, and enriched Metadata Management, which are not typically found in the aforementioned open-source tools. However, also the data quality features are more comprehensive, more advanced functions are included out of the box, including but not limited to: reference data checks, complex string manipulation, and set operations.
Despite its robust feature set, Ataccama’s main limitation lies in its less flexible integration with Python-based data pipelines, which is where Soda Core and Great Expectations excel due to their native Python support. This limitation is a significant drawback for data engineers who rely heavily on Python for data processing and pipeline orchestration and for this reason this thesis will focus on enhancing Ataccama's integration capabilities with Python-based data pipelines.
\section{The persona of Data engineer and their needs}
As previously noted, data engineers play a crucial role in integrating data quality tools into data pipelines. It is essential to design applications that meet the specific needs and requirements of its primary users. Therefore, any solution intended for pipeline integration must be carefully tailored to accommodate the unique needs of data engineers. This ensures that the tools not only fit seamlessly into their workflows but also enhance their efficiency and effectiveness in managing data quality across systems.
\subsection{Used technologies}
Data engineers, like any others, are working professionals. Many of them are used to working with some particular tools and technologies. It is vital to take into account the tools and technologies that data engineers are familiar with when designing a solution for them.
Although among the many data engineers, there are differences in the tools and technologies they use, there are some common tools and technologies that are used by the majority of them in today's world. It is important to realize that we should not only consider an ideal data engineer, a skilled senior, but also consider that there are many junior data engineers. Also, it is important to consider the cognitive load of new tools and technologies on data engineers. Not only should the tools and technologies be easy to learn and use, but also they should be based on familiar concepts and technologies. This way, the data engineers can focus on their work and not on learning new tools and technologies.
The most commonly used tools and technologies by data engineers are Python, Java, Scala, and SQL. These tools and technologies are used for the development of data pipelines and the integration of data quality tools into these pipelines. For any set of data engineers, the intersection of their knowledge bases will include Python more often than anything else.
\subsection{Ease of use}
In any software, a balance between ease of use and functionality complemented by the correctness of conceptual abstraction needs to be found. As the API surface of this solution is not going to be large, not a lot of decisions will have to be made on this front. However, it is important to keep in mind that the users of the solution are data engineers, some of whom work in a consultant background. Many of them are not used to advanced language features, and abstractions have limited experience with software engineering. For this reason, it is important to keep the solution simple and easy to use, avoiding any complex constructs and patterns.
Furthermore, it should be taken into account that the library can be used outside of IDE and similar environments where code suggestion and documentation might not be available. For example, when writing integrations in the platforms discussed below such as Azure Data Factory and Data Bricks, the user will not have access to the documentation or code suggestions. For this reason, the API should be designed to be self-explanatory, and should be designed to be easy to use without the need for extensive documentation.
\subsection{Data security}
When designing data quality solutions for integration into existing data pipelines, especially those that involve interfacing with external applications or servers, security is a paramount concern. This is particularly critical when the data involved is sensitive, as is often the case in industries such as finance, healthcare, and government.
Sending data over the internet to a third-party service can be a security risk. Data security is a major concern for data engineers, and it is important to take into account the security requirements of data engineers when designing a solution for them.
When a data quality integration in a pipeline needs to access the server - an application running somewhere else - the network of the server has to be accessible. This can be a security risk as every new network endpoint provides an additional entry point for attackers. When applications within a private network start communicating with external entities, these points of interaction need to be secured, adding complexity and potential for oversight. If the networked application has vulnerabilities, such as insufficient authentication, flawed authorization practices, or software bugs, it could be exploited by attackers to gain unauthorized access. This could lead to data breaches, data loss, or malicious data manipulation.
Additionally, data transmitted over networks can be intercepted, viewed, or altered by unauthorized parties if not adequately protected. This risk is particularly severe if data is transmitted over unsecured or improperly secured connections, such as those not using TLS/SSL protocols.
\subsection{Ease of configuration}
The need to access a running instance of a \glsxtrshort{dqm} application in order to run data quality tooling comes with added complexities.
First, the application needs to be configured and running. This is fine for an environment where such an application is already in use. Yet, still, it is an added complexity as part of the process is running somewhere else, so it can be more difficult to debug, monitor, and maintain.
Second, the pipeline needs to access the application over a network. This means that the application needs to be exposed to the network, which can not only be a security risk but also provide further complexities in terms of network configuration. In case of the server being accessible only on a private network, the application needs to be exposed to the network, which can, in some cases, be even out of question and make the integration impossible, or it can be an obstacle on the way to successful integration.
\section{Data pipelines and requirements for the integration of data quality tools}
Data pipelines are a crucial part of any data engineering project. Data pipelines are used to move data from one place to another and to transform data from one format to another.
Many of the use cases for integrating data quality tools into data pipelines include the requirement to integrate with existing data pipeline or solutions. The data quality tools should support integration into commonly used data pipelines. It also follows that forcing a new data pipeline or \glsxtrfull{etl} solution is not a valid requirement.
For example, in Ataccama, the application is intended to be connected to all the data sources and data targets using its custom connectors. To access the Ataccama engine and run any sort of evaluation of data quality rules, the data must be loaded into Ataccama ONE using a connection setup within the application, or Ataccama can pushdown processing directly into databases or the data needs to be sent into a service set up from within the application. Either way, there is no straightforward way to process the data directly in the data stream because Ataccama current solution is oriented more toward table processing. In summary, all of these approaches present a challenge for integrating Ataccama into existing data pipelines.
\subsection{Pipelines in commonly used data platforms}
Modern data ecosystems are diverse, with organizations leveraging a variety of data storage solutions and computing environments to manage and analyze data. Here’s how Python integration plays a critical role across commonly used platforms:
\begin{itemize}
\item Snowflake
Snowflake\cite{snowflake} supports multiple programming languages, including Java and .NET, but Python remains a popular choice due to its extensive library support and community.
Python is well-supported in Snowflake through connectors like Snowflake Connector for Python, which allows executing SQL statements and performing data manipulations directly from Python scripts.
\item AWS Glue
AWS Glue\cite{aws_glue} supports Python and Scala. Python, being one of the main languages supported by AWS Glue, benefits from seamless integration with other AWS services.
Python scripts in AWS Glue can perform \glsxtrshort{etl} tasks effectively, which can be developed and debugged directly in Python using Glue’s development endpoints.
\item Azure Data Factory
Azure Data Factory\cite{azure_data_factory} supports custom activities in various languages, but Python’s use in Azure functions for custom processing activities is notable due to its simplicity and effectiveness.
Python in ADF can be used to orchestrate complex data workflows, invoking Python-based processes as part of the data integration pipelines.
\item Databricks
Databricks\cite{databricks} offers a unified analytics platform that supports Python, Scala, SQL, and R. Python’s integration, particularly with PySpark for big data processing, makes it a primary choice for many developers.
Python is extensively used in Databricks notebooks for data exploration, visualization, and machine learning, highlighting its versatility and ease of use.
\end{itemize}
Given the need to operate within commonly used compute platforms such as the above-mentioned, it is imperative to consider the compatibility of programming languages supported by these environments. Each platform offers support for various technologies; however, Python stands out due to its universal acceptance and extensive integration capabilities across these systems. Whether it is executing complex data manipulation tasks in Snowflake, orchestrating \glsxtrshort{etl} processes in AWS Glue, running custom activities in Azure Data Factory, or performing data analysis and machine learning in Databricks, Python is consistently supported.
Therefore, focusing on Python to implement data quality rules not only aligns with the operational capabilities of these platforms but also ensures that our solutions are versatile and adaptable across different technological ecosystems. This strategic choice maximizes the utility and reach of our \glsxtrshort{dqm} tools, making them accessible and functional within the predominant data processing frameworks employed by contemporary organizations.
\section{Enhancing Ataccama's Integration Capabilities}
While Ataccama offers a rich set of data quality management features, one critical area where enhancement is needed is in its programmatic accessibility. This thesis sets out to address this limitation by focusing on the development of methodologies that will enable better integration into automated data environments, particularly through the reimagination of Ataccama’s data quality expression language.
The core of this enhancement strategy involves reimplementing Ataccama's expression language in a way that maintains full compatibility with the original system. The intention is not to build entirely new features but to translate the existing capabilities into forms that are more accessible for programmatic use. This effort requires careful consideration to ensure that all functionalities remain consistent with Ataccama’s established methods, thus preserving the integrity and reliability of the platform while extending its usability.
This focus on reimplementing the expression language aims to facilitate the direct application of Ataccama's powerful data quality rules within more diverse programming environments. By doing so, the project seeks to bridge the gap between Ataccama’s robust data management tools and the practical, operational needs of modern data pipelines, making it more adaptable for data engineers who need to incorporate sophisticated data quality checks directly into their workflows. The outcome will be a more flexible tool that fits seamlessly into existing data infrastructures, enhancing Ataccama’s integration capabilities while upholding the essence of its trusted features.
\section{Ataccama Expression Language}
To facilitate the integration of Ataccama's data quality rules, a thorough understanding of the Ataccama Expression Language is essential. Below is an overview of the language components, types, and operational logic.
\subsection{Anatomy of an expression}
The structure of an expression in Ataccama's language consists of:
\begin{itemize}
\item Statements
\begin{itemize}
\item Variable Assignments
Variables are assigned values which can include literals, operations, or function calls.
\item Function Definitions
Optional definitions that encapsulate logic or operations reusable within the expression.
\end{itemize}
\item Resulting Expression
This is the final part of the expression where the previously defined variables and functions are utilized to compute a result. The outcome of the resulting expression is the output of the entire expression.
\end{itemize}
\subsection{Examples of an expression}
The two following examples illustrate the structure of an expression in Ataccama's language.
\subsubsection{Simple example: Arithmetic Expression}
\begin{verbatim}
a := 10;
b := a * 2;
b + 5
\end{verbatim}
In this example, variables a and b are used in statements to set up values that are manipulated in the resulting expression, b + 5, which computes the final output.
\subsubsection{Complex example: Digit sum}
\begin{verbatim}
value := replace(trim(input), '-', '');
function digitSum(integer digit) {
set.sumExp(
trim(substituteAll("(.)", "$1 ", toString(digit))),
" ",
(x) {toInteger(x)}
)
}
digitSum(value) % 13 == 0
\end{verbatim}
This example demonstrates a more complex expression that includes a function definition, digitSum, which is then called in the resulting expression, digitSum(value). The expression is a simplified excerpt from a data quality rule that checks ISIN numbers for validity.
\subsection{Components of Ataccama Expression Language}
Ataccama Expression Language, also called ONE Expressions, organizes data operations and logic into a series of expressions and operands defined by a rigorous structure:
\subsubsection{Operands}
Operands in Ataccama Expressions are categorized into four main types:
\begin{itemize}
\item Literals
These include numeric, string, or logical constants (e.g., TRUE, FALSE) and the null literal. All keywords are case-insensitive.
\item Columns
Identified by their names, which require square brackets if they include spaces. In multiple input scenarios, columns are specified using dot notation (input\_name.column\_name). If only one input is used, dot notation can be omitted.
\item Set
Used exclusively with the IN operation, representing a constant expression. Sets can only appear on the right side of the IN operation.
\item Complex Expressions
These may involve various combinations of the above operand types and function calls.
\end{itemize}
\subsubsection{Data Types and Conversions}
Operands can be of specific column types such as INTEGER, FLOAT, LONG, STRING, DATETIME, DAY, and BOOLEAN. Ataccama handles type conversions automatically, widening data types as necessary ( e.g., INTEGER - LONG - FLOAT ) to accommodate operations.
\subsubsection{Handling Null Values}
The handling of null values aligns with SQL rules, with a notable exception for the STRING data type. In Ataccama, a null string is considered equal to an empty string, impacting how comparisons and operations are performed.
\subsubsection{Variables}
Expressions in Ataccama can include sequences of assignment expressions followed by a resulting expression, separated by semicolons. The first occurrence of a variable defines its type, with subsequent references needing to conform to this type.
\subsubsection{Operations and Functions}
Ataccama ONE supports a variety of operations and functions:
\begin{itemize}
\item arithmetic functions
\item logical functions
\item comparison functions
\item set functions
\item date functions
\item string functions
\item bitwise functions
\item MinMax functions
\item aggregating functions
\item conditional functions
\item conversion functions
\item formatting functions
\item word set operation functions
\end{itemize}
The full list of functions and their descriptions can be found in the Ataccama ONE Expressions documentation \cite{one_expressions}.
\section{Summary}
Let us summarize the goals of the project:
\subsection{Use of Python}
The solution should utilize Python for the implementation of the data quality expression language. This choice is driven by two key reasons. Firstly, Python is widely recognized and utilized among data engineers, which ensures that the tools developed are easily adoptable and integrate seamlessly into existing workflows. Secondly, Python's compatibility with various data pipeline platforms makes it an ideal candidate for ensuring that our solution can be integrated across diverse data environments efficiently, facilitating broader accessibility and practical utility.
\subsection{Simple API}
In the analysis section, we identified data engineers as the primary users of the solution. The project's main objective is to develop an API that is straightforward and intuitive for data engineers. The API should be simple and user-friendly, avoiding overly complex abstractions or advanced design patterns. For instance, object creation should be straightforward, utilizing basic constructors or simple imported functions to minimize complexity and ensure ease of use.
\subsection{Local execution}
The solution should be designed to run locally, allowing data engineers to execute data quality rules directly within their Python environments. This is relevant with respect to the need for ease of use and security. By enabling local execution, data engineers can test and validate data quality rules without the need to access external servers or applications. This approach also simplifies the development process by removing dependencies on external services, ensuring that the solution is self-contained and easily deployable in various environments.
\subsection{Compatibility with the Ataccama Expression Language}
In order to maintain compatibility with the existing Ataccama ecosystem, the Python implementation should be able to execute the same data quality rules as the original Ataccama Expression Language. This includes supporting the same set of functions, operators, and expressions to ensure that data quality rules defined in Ataccama ONE can be seamlessly executed within Python environments. The Python implementation should mirror the behavior of the original Ataccama Expression Language as closely as possible to ensure consistency and compatibility across platforms.
\subsection{Reasonable performance}
To be considered a viable alternative to the original Ataccama implementation and to similar solutions on the market, the Python implementation should be able to handle data quality rules within Python environments efficiently and effectively. The performance of the Python implementation should be within acceptable limits, where a slowdown by a factor of up to 10 times compared to alternatives might be considered tolerable for deployment, but a 1000 times slowdown would indicate serious efficiency issues that could render the solution impractical. By establishing these performance benchmarks, we can validate that the Python implementation meets minimum requirements for real-world applications, ensuring it is a viable alternative for data engineers who require programmatic access to Ataccama's data quality tools. This will be further discussed in the evaluation section.