The Export - GCSFS File Writer custom step provides an easy way to write SAS and CAS Datasets to Google Cloud Storage (GCS) in Parquet and Deltalake format.
It supports writing compressed Parquet and Deltalake files using one of the available compressions (snappy, brotli, gzip, lz4 or zstd) to reduce storage requirements. It also supports writing partitioned Parquet and Deltalake datasets based in a particular column or set of columns. This allows for more efficient querying and processing of large datasets, as only the relevant partitions need to be accessed. To control how to handle data that already exists in the destination the field Existing data behavior is provided with the following configuration alternatives: Parquet:
- overwrite_or_ignore: will ignore any existing data and will overwrite files with the same name as an output file. Other existing files will be ignored. This behavior, in combination with a unique basename_template for each write, will allow for an append workflow.
- error: will raise an error if any data exists in the destination.
- delete_matching: is useful when you are writing a partitioned dataset. The first time each partition directory is encountered the entire directory will be deleted. This allows you to overwrite old partitions completely. Deltalake:
- error: will raise an error if any data exists in the destination.
- append: will add new data.
- overwrite: will replace table with new data.
- ignore: will not write anything if table already exists.
This custom step helps to work around some of the restrictions that currently exist for working with Parquet files in SAS Viya. Please check the following documentation that lists those restrictions for the latest SAS Viya release:
- Restrictions for Parquet File Features for the libname engine (SAS Compute Server)
- Google Cloud Storage Data Source (SAS Cloud Analytic Services)
- Path-Based Data Source Types and Options – which has a footnote for Parquet (SAS Cloud Analytic Services)
This custom step depends on having a python environment configured with some additional libraries installed:
- gcsfs
- cloud-arrow
- deltalake
- pandas
- pyarrow
- saspy
It has been tested on SAS Viya version Stable 2023.11 with python version 3.9.16 and the following libraries versions:
- gcsfs==2023.10.0
- cloud-arrow == 0.4.0
- deltalake==0.14.0
- pandas==1.5.3
- pyarrow==14.0.1
- saspy==5.4.4
- Version 1.0 (21DEC2023)
- Initial version