Skip to content

Intelligent data replication between two HPCC Systems clusters

Notifications You must be signed in to change notification settings

dcamper/DataPull

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 

Repository files navigation

DataPull

This bundle provides intelligent data replication between two HPCC Systems clusters. Only changes between the two clusters are replicated, including both regular files and superfile structures.

Requirements

The code included in this bundle is written entirely in ECL. No extra plugins or third party tools are required, though functions from the Std library (included with the platform) are used. HPCC 6.0.0 or later is required.

License and Version

This software is licensed under the Apache v2 license. A link to the license, as well as the current version of this software, can be found in the Bundle.ecl file.

Installation

To install a bundle to your development machine, use the ecl command line tool:

ecl bundle install https://github.com/dcamper/DataPull.git

For complete details, see the Client Tools Manual, available in the download section of https://hpccsystems.com.

Note that is possible to use this code without installing it as a bundle. To do so, simply make it available within your IDE and just ignore the Bundle.ecl file. With the Windows IDE, the DataPull directory must not be a top-level item in your repository list; it needs to be installed one level below the top level, such as within your "My Files" folder. If you use this technique then your ECL IMPORT statement will change slightly as well: instead of IMPORT DataPull; you will have to use IMPORT DataPull.DataPull;.

Release Notes

Click to expand
Version Notes
1.0.0 Initial public release
1.0.1 Change SEQUENTIAL calls to ORDERED for performance (avoids subgraph duplication)
1.1.0 Add disableContentCheck option
1.2.0 Add enableNoSplit option
1.2.1 Avoid copying subfiles that already exist on the destination but are not yet attached to their superfiles
1.2.2 Add asOfDate support

Overview

This is strictly a "pull" copy scheme where the intention is to make the local system (whatever is running this code) "mirror" the remote system, strictly for those files that match one or more of the given filename patterns. Care should be taken when specifying filename patterns, especially those with prefix and suffix wildcards (e.g. *fubar*). Any local file or superfile that matches a pattern is subject to modification or deletion, depending on whether that file exists on the remote system or not. It is easy to lose local files that way, by inadvertently referencing them with a filename pattern intended for something else.

The full contents of superfiles will be copied as well, even if the subfiles do not match any of the filename patterns. Relatedly, superfile contents are modified if necessary, such as when the remote system lists different subfiles for a superfile that the local system already has. In that case, the code will copy any subfiles (if necessary) and alter the superfile relationships so they match the remote system.

Regular files are copied only if necessary. If a file already exists in both the systems, it is examined for change (size, content or metadata) and copied only if a difference is found. Checking the content takes extra work and, if the number of files to be examined is large, may be time-consuming. If you are confident that files are not overwritten in the remote system (meaning, a file with a given name will never change its contents) then you can disable the content check with a disableContentCheck parameter.

Optional cluster name mapping is supported. This covers the case where a remote file may exist on a cluster with a name that doesn't exist on the local system. The most common example is probably 'thor' vs. 'mythor' -- two common Thor cluster names that seem to pop up in simple configurations. The map indicates on which local cluster to put a new or modified file, file, given the name of the remote cluster.

The optional asOfDate parameter, added in v1.2.2, provides a way to limit the files that DataPull examines. If provided, files and superfiles modified prior to the date are ignored at both the remote and local clusters. asOfDate values can be in either YYYY-MM-DD or YYYY-MM-DDTHH:MM:SS (note the T delimiter) format. An empty asOfDate means there is no constraint.

The code can be executed in "dry run" mode (which is the default). In this mode, every action that would normally be taken is compiled into a list of commands and then displayed in a workunit result. This gives you the opportunity to see what the code would do if only given the chance.

This code must be executed on the hthor HPCC engine. If you try to execute it on a different engine then it will fail with an informative error.

Further information can be found within the DataPull.ecl file.

Known Limitation

This code will not correctly process Roxie indexes that are in use on the local system and need to be modified, nor will it update local Roxie queries that need new data coming in from the remote system.

Example code

IMPORT DataPull;

// Mirror all of my files and any file or superfile with 'search' in the name
FILE_PATTERNS := ['dcamper::*', '*search*'];
REMOTE_DALI := '10.173.147.1';

// Make sure that any remote files existing on the 'hthor__myeclagent'
// cluster are copied to the local 'hthor' cluster
clusters := DATASET
	(
		[
			{'hthor__myeclagent', 'hthor'}
		],
		DataPull.ClusterMapRec
	);

DataPull.Go
	(
		REMOTE_DALI,
		FILE_PATTERNS,
		clusterMap := clusters,
		disableContentCheck := FALSE,
		enableNoSplit := FALSE,
		isDryRun := TRUE,
        asOfDate := ''
	);

About

Intelligent data replication between two HPCC Systems clusters

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages