This document describes how to use Google Cloud services, e.g., Google Cloud Storage (GCS) and BigQuery as data sources
or sinks in SparkApplication
s. For a detailed tutorial on building Spark applications that access GCS and BigQuery,
please refer to Using Spark on Kubernetes Engine to Process Data in BigQuery.
A Spark application requires the GCS and
BigQuery connectors to access GCS and BigQuery
using the Hadoop FileSystem
API. One way to make the connectors available to the driver and executors is to use a
custom Spark image with the connectors built-in, as this example Dockerfile shows.
An image built from this Dockerfile is located at gcr.io/ynli-k8s/spark:v2.3.0-gcs
.
The connectors require certain Hadoop properties to be set properly to function. Setting Hadoop properties can be done
both through a custom Hadoop configuration file, namely, core-site.xml
in a custom image, or via the spec.hadoopConf
section in a SparkApplication
. The example Dockerfile mentioned above shows the use of a custom core-site.xml
and a
custom spark-env.sh
that points the environment variable HADOOP_CONF_DIR
to the directory in the container where
core-site.xml
is located. The example core-site.xml
and spark-env.sh
can be found
here.
The GCS and BigQuery connectors need to authenticate with the GCS and BigQuery services before they can use the services. The connectors support using a GCP service account JSON key file for authentication. The service account must have the necessary IAM roles for access GCS and/or BigQuery granted. The tutorial has detailed information on how to create an service account, grant it the right roles, furnish a key, and download a JSON key file. To tell the connectors to use a service JSON key file for authentication, the following Hadoop configuration properties must be set:
google.cloud.auth.service.account.enable=true
google.cloud.auth.service.account.json.keyfile=<path to the service account JSON key file in the container>
The most common way of getting the service account JSON key file into the driver and executor containers is mount the key file in through a Kubernetes secret volume. Detailed information on how to create a secret can be found in the tutorial.
Below is an example SparkApplication
using the custom image at gcr.io/ynli-k8s/spark:v2.3.0-gcs
with the GCS/BigQuery
connectors and the custom Hadoop configuration files above built-in. Note that some of the necessary Hadoop configuration
properties are set using spec.hadoopConf
. Those Hadoop configuration properties are additional to the ones set in the
built-in core-site.xml
. They are set here instead of in core-site.xml
because of their application-specific nature.
The ones set in core-site.xml
apply to all applications using the image. Also note how the Kubernetes secret named
gcs-bg
that stores the service account JSON key file gets mounted into both the driver and executors. The environment
variable GCS_PROJECT_ID
must be set when using the image at gcr.io/ynli-k8s/spark:v2.3.0-gcs
.
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: foo-gcs-bg
spec:
type: Java
mode: cluster
image: gcr.io/ynli-k8s/spark:v2.3.0-gcs
imagePullPolicy: Always
hadoopConf:
"fs.gs.project.id": "foo"
"fs.gs.system.bucket": "foo-bucket"
"google.cloud.auth.service.account.enable": "true"
"google.cloud.auth.service.account.json.keyfile": "/mnt/secrets/key.json"
driver:
cores: 1
secrets:
- name: "gcs-bq"
path: "/mnt/secrets"
secretType: GCPServiceAccount
envVars:
GCS_PROJECT_ID: foo
serviceAccount: spark
executor:
instances: 2
cores: 1
memory: "512m"
secrets:
- name: "gcs-bq"
path: "/mnt/secrets"
secretType: GCPServiceAccount
envVars:
GCS_PROJECT_ID: foo