Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support schema path templates in database source relations #243

Open
bhtucker opened this issue Aug 29, 2020 · 1 comment
Open

Support schema path templates in database source relations #243

bhtucker opened this issue Aug 29, 2020 · 1 comment

Comments

@bhtucker
Copy link
Contributor

Summary

Extract for database targets doesn't support the power config rendering that's available for static sources.

In extract, the output target directory comes from relation.data_directory, whereas for static sources and unloads, the schema-level path template is used.

Details

Both systems get at the same 'universe' of remote data file/directory addresses:

Unload:

s3_key_prefix = "{schema.s3_unload_path_prefix}/data/{schema.name}/{source.schema}-{source.table}/csv".format(
        schema=schema, source=relation.target_table_name,
    )

Sqoop:

            "--target-dir",
            '"s3n://{}/{}"'.format(relation.bucket_name, relation.data_directory()),

where data_directory is:

         return os.path.join(
            from_prefix or self.prefix or ".",
            "data",
            self.source_path_name,
            (self.schema_config.s3_data_format.format or "CSV").lower(),
        )

The Unload formulation is a bit more powerful. By moving extract targets onto the render-based system, the same 'archiving' use case (e.g. retain daily snapshots of relations using today/yesterday config values) that templating supports in unload can be done directly from upstream DBs at extract time.

I also see data_lake is in the config and seems related but didn't quite see how it fits in. Hopefully this could be involved in the 'harmonization' of these two systems in such a way as to allow configuration of the storage backend for extract/unload between e.g. GCS vs S3.

Labels Please set the label on the issue so that

  • you pick bug fix, feature, or enhancement
  • you pick one of the components of Arthur, such as component: extract or component: load

feature
component: extract

@tvogels01
Copy link
Contributor

Makes sense. I'm not 100% sure that data lake vs. object store is applied consistently. In general, transient files from running the ETL like the schemas and data directory should be in the object store. Unloads should go into a "data lake" where they can be consumed by others systems. One could argue that the extracts also should go into the "data lake".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants