When replicating data from one source system to a target system often only minor structural changes are needed for the replication. Technically you can realize data replication use cases using SAP Data Intelligence Cloud pipelines, but there are also some important considerations and limitations to mention when using pipelines for such a scenario.
The following generation 2 pipeline illustrates an example for a simple 1:1 replication of data from an SAP source into a file object store as a target.
Important aspects to highlight are:
- 1 data set replication ≙ 1 pipeline in Data Intelligence with possibility to generalize pipeline execution with variables in certain scenarios.
- Recovery of data integration pipelines in case of various error situations using resilience & snapshot functionality in generation 2 pipelines.
- High total cost of ownership (TCO) when having large amount of data sets (e.g. hundreds or thousands of CDS Views or tables) in a replication use case which results in creating as well as maintaining a lot of Data Intelligence pipelines.
- Limited performance scalability in pipelines, e.g. in the area of parallelizing initial as well as delta load processes.
Therefore, replication flows provisioned via the so called "Replication Management Service (RMS)" have been made available to simplify the realization of data replication use cases in SAP Data Intelligence Cloud. Whereas RMS includes the whole data replication service incl. its dependent components, a "Replication Flow" is the name of the artefact that a user creates & maintaines inside the SAP Data Intelligence Cloud Modeler application. The main capabilities and functional foundation are visualized in the following illustration:
The main functionalities of Replication Flows cover:
- Model data replication from a selected source to a selected target. In this case a more simplified way of realizing "mass data replication use cases" is being offered to move data very easy from a source to a target system.
- Initial focus on 1:1 replication of with simple projections and filters, e.g. adding, adjusting and removal of columns as well as ability to provide row-level filters on one or multiple.
- Dedicated user interface for modeling mass data replication via a new interface that is embedded in the existing modeler application and optimized for mass data replication scenarios to offer a simplified user experience.
- Lower total cost of ownership (TCO) and total development costs (TDC) for customers realizing mass data replication scenarios in SAP Data Intelligence Cloud compared to using pipelines for such use cases.
- Support initial load as well as delta load capabilities, which is based on trigger-based change-data-capture (CDC) using logging tables on the connected source systems.
- Support parallelization during initial load through partitioning to achieve a parallelized data load.
- Support resiliency functionalities & automated recovery in case of error scenarios and also during maintenance of SAP Data Intelligence Cloud.
Looking at the supported source & target connectivity, different SAP and non-SAP connectivity can currently be used when creating a Replication Flow, which can also be checked in our product documentation under the following Link.
Replication Flow source and target connectivity
The supported source connectivity includes:
- SAP S/4HANA Cloud
- SAP S/4HANA on-Premise
- SAP Business Suite & SAP S/4HANA Foundation via SLT
- SAP Business Warehouse
- Azure MS SQL
The supported target connectivity includes:
- SAP HANA Cloud
- SAP HANA Data Lake Files (HDL-Files)
- Amazon S3
- Microsoft Azure Data Lake Gen 2
- Google Cloud Storage
- Kafka
There are partially special configurations available for specific target connections, such as different file formats for target objects stores (e.g. CSV, Parquet etc.) as well as data format & compression for Kafka as a target. More information about these configuration settings can be found in our product documentation.
Connectivity configuration parameters
This sub-chapter will provide an overview for how to create a Replication Flow including an explanation for all relevant settings a user needs to define in the different steps.
You can also get additional information for a step by step in the exercise chapter of this hands-on session: Replication Flow Exercises
First of all we will start inside the Modeler application in SAP Data Intelligence Cloud, where you can create a Replication Flow and provide a name:
-
Description
Provide an optional description for your Replication Flow, otherwise you can leave it empty.
-
Source Connection
Define your source connection from the connection management. Be aware that the list is filtered by connection types that are supported by Replication Flows as a source system.
-
Source Container
Define the source container. In case of a database it is the database schema, but in case of ABAP it is the logical object you want to replicate (CDS (= CDS Views), SLT(tables), or ODP_SAPI / ODP_BW for ODP objects).
-
Target Connection
Define your target connection from the connection management. Be aware that the list is filtered by connection types that are supported by Replication Flows as a target system.
-
Target Container
Define the target container. In case of a database, it is the database schema and in case of an object store it is target root folder in which you want to replicate the data. In case of Kafka as a target, no target container is needed as we directly replicate into topics of a Kafka broker without the need of a target container.
- Target connection specific properties that are automatically popping up in the user interface if a certain connection is being specified.
-
Object Stores (AWS S3, GCS, Azure Data Lake Gen 2, HDL Files)
- Group Delta By (Date, Time) allowing users to define if the delta records should be automatically grouped in folders based on date or time.
- File Type (CSV, Parquet, JSON, JSONLines)
- Compression (for Parquet)
- Delimiter (for CSV)
- Header Line (for CSV)
- Orient (for JSON)
-
Kafka
- Serialization type (json, avro)
- Compression (none, gzip, snappy, lz4, zstandard)
- Number of partitions (integer number, default is 1)
- Replication Factor (integer number, default is 1)
-
The below picture illustrates some of configurations available for object store target systems:
Once you have defined the main configurations settings in the Properties tab, you can go to the Tasks tab of your Replication Flow. Inside this tab you will add the actual data sets to your Replication Flow and map it to your target data set.
Open the Tasks tab
Click the Create button to add one or multiple data sets, e.g. a CDS View, into your Replication Flow. For each selected data set you can perform the following configurations:
-
Source
Represents your selected source data set and cannot be changed once selected.
-
Source Filter
Optionally you can define a filter on one or multiple columns if you do not want to load the complete source data set.
-
Mapping
By default all columns from the source data set are being replicated to the target data set using an auto mapping with the exact same column names in the source & target data set. You can use the mapping dialog to customize the standard mapping, e.g. if the column names differ from each other. Additionally, you can remove columns that are not needed and also create additional columns and either map new columns to existing column of fill it with constant values or pre-defined functions (e.g. CURRENT_DATE). Please note that when browsing and selecting a pre-defined target data sets, e.g. a table in HANA Cloud, you can create additional columns.
-
Target
Select the target data set in which the data should be replicated. By default, the same name from the source data set is being used, but you can also change the default name and either select a pre-defined target data set or let the Replication Flow create the target data set.
-
Load Type
Select the load type for each Task where you can select Initial Only or Initial and Delta. Initial Only will load the data via a full load without any change data capture (CDC) or delta capabilities. Initial and Delta will perform the initial load of a data set followed by replicating all changes (inserts, updates, deletes) for this data set. Furthermore, the required technical artefacts on the source to initiate the delta processes are automatically being created.
-
Truncate
A check box that allows users to clean-up the target data set, e.g. in case a user want to re-initialize the data replication with a new initial load.
Before you can run a Replication Flow, you can hit the Validate button in the top menu bar to check if all necessary configurations are specified. If this is the case, you should receive a popup indicating that validation was successful:
Next, you can Deploy the Replication Flow by clicking the deploy button in the top menu bar:
The deployment process will make sure that the necessary run-time artefacts are being generated before you can finally start a Replication Flow by clicking the Run button:
Monitoring Replication Flows is embedded inside the SAP Data Intelligence Cloud Modeler application. You can either use the Go To Monitoring button inside the Modeler application within the Replication Flow dialog or directly open the Monitoring application from the SAP Data Intelligence Cloud Launchpad:
The monitoring of Replication Flows is divided into two layers. In the first layer on the top of the screen shows an overview of the different Replication Flows existing in the SAP Data Intelligence Cloud tenant and in the second layer on the bottom of the screen you can find detailed information for the different Tasks for each selected Replication Flow.
In the first layer of the screen you can find the following information for each Replication Flow:
- Name = Name of the Replication Flow
- Source Connection = Defined source connection
- Target Connection = Defined target connection
- Datasets = Number of data sets / tasks existing in the Replication Flow
- Progress = Overall progress of initial load of all Tasks
- Changed At = Timestamp when the Replication Flow changed the last time
- Changed By = The user who changed the Replication Flow the last time
- Created By = User who created the Replication Flow
- Created At = Timestamp when the Replication Flow was created
In the second layer of the screen you can find the following information about the Tasks once you select a Replication Flow with your mouse in the first layer:
- Source data set = Defined source data set name
- Target data set = Defined target data set name
- Priority = priority of the Task
- Last Run Status = Status of each Task (e.g. Transferring delta load)
- Operations = row count per Task, which includes both record count for initial load as well as delta load
- Partitions = number of partitions defined for each Task for parallelization
- Load Type = Load type that has been defined in the modeler application
- Runtime updated = Timestamp indicating when last package was processed
- Start Time = Start Time of the Task
- Initial Load End Time = Time at which the initial load ended
- Duration = time needed to perform the initial load
The following sub-chapter describes a deep dive into the topic how a user can integrate the various types of SAP ABAP based systems as a source with Replication Flows.
We will start with a first high level overview which kind of data sets & artefacts can be integrated with each SAP ABAP system.
Now we take this overview and provide some more granular view on the type of SAP System that can be integration with Replication Flows incl. a brief overview on the minimum version that is required. More information about the ABAP integration with SAP Data Intelligence Cloud can be found here: SAP Data Intelligence ABAP Integration
Important Note: It is always recommended to check the central SAP Note mentioned above as well as the individual SAP Note we have for each SAP ABAP System for the minimum pre-requisites as well as implementing all referenced SAP notes to fix known issues, e.g. for integrating SAP S/4HANA 2021 you can check the following Note SAP Data Intelligence ABAP Integration - SAP S/4HANA 2021 and for DMIS 2018 SP07 you can check this SAP Note SAP Data Intelligence ABAP Integration - DMIS 2018 SP07 .
In the underlying architecture Replication Flows are executed by so called "worker graphs", which is internally built based on SAP Data Intelligence pipelines, but optimized for data replication use cases to overcome the limitations we have seen in the beginning of this deep dive when using regular pipelines.
A worker graph is being executed in the background in case a user triggers the execution of a Replication Flow and mainly consists of the source & target connectivity + projection & mapping in case the user is defining a filter or changes the structure of the data set. Theoretically, there is no limit for a user to define how much data sets (also known as Tasks) can be added inside a single Replication Flow, but there are some important aspects we will highlight below that influence this decision.
Important Note: A user cannot create such a worker graph like a regular pipeline and it should only be seen as the technical runtime artefact, which is automatically triggered when a Replication Flow is being executed. The way for creating a Replication Flow should be done via the step by step approach using the Replications tab in the Modeler application as described in this deep dive and the exercise.
Please check below how a worker graph looks like:
In contrast to pipelines, a single worker graph as illustrated above can replicate multiple data sets from the source to the target. Each worker graph has by default a total of 10 connections (5 source and 5 target connections) through which the data can be replicated and by default a single Replication Flow has two worker graphs assigned. This setting can be adjusted so that multiple worker graphs are started for a Replication Flow depending on the use case. Additional information can also be found under the following link: Sizing Replications
The number of connections per replication flows can be checked in the monitoring application using the tab "Replications" on a Replication Flow level by clicking the configuration button. More information can be found here when looking for the actions in the "Replications" tab in the Monitoring application
TechEd exercises for Replication Flows
Step by Step Guide for creating a Replication Flow
Overview of supported source & target connections