Summary: Learn how to use Remote Functions to de/re-identify data with DLP in BigQuery using SQL.
This document discusses how to detect and de-identify sensitive data like personally identifiable information (PII) in BigQuery tables with simple SQL based functions, using Cloud Data Loss Prevention (Cloud DLP). De-identification techniques like encryption lets you preserve the utility of your data for joining or analytics while reducing the risk of handling the data by de-identifying/obfuscating the raw sensitive identifiers.
Enterprises may have policy or regulatory requirements to only store de-identified data in their DataWarehouses, and a need to quickly re-identify the de-identify data for reports generation. To minimize the risk of handling large volumes of sensitive data, you can use an automated data transformation pipeline to create de-identified datasets that can be used for migrating from on-premises to cloud. You can use this project to either replace that pipeline with a SQL query for de-identification and quick re-identification or only for re-identification.
Cloud DLP can inspect the data for sensitive information when the dataset has not been characterized, by using more than 100 built-in classifiers. Using DLP to de-identify the data requires complex data pipelines/applications. This solution aims to help your data analysts/engineers/scientists to achieve the same outcome through Simple SQL functions using BigQuery Remote Functions which are a powerful way to extend the functionality of BigQuery.
This document demonstrates a reference implementation of de-identifying structured data in BigQuery using DLP and remote function (hosted on Cloud Run).
This document is intended for audience whose responsibilities include data security, data processing, or data analytics. This guide assumes that you're familiar with data processing and data privacy, without the need to be an expert and run simple shell and SQL scripts.
- Deploy Cloud Run service providing DLP based encryption functionality
- Create BigQuery Remote functions that use DLP de-identify templates
- Verify data encryption in BigQuery using a SQL query
This tutorial uses billable components of Google Cloud, including the following:
Use the pricing calculator to generate a cost estimate based on your projected usage.
For this tutorial, you need a Google Cloud project. To make cleanup easiest at the end of the tutorial, we recommend that you create a new project for this tutorial.
-
Make sure that billing is enabled for your Google Cloud project.
-
At the bottom of the Cloud Console, a Cloud Shell session opens and displays a command-line prompt. Cloud Shell is a shell environment with the Cloud SDK already installed, including the gcloud command-line tool, and with values already set for your current project. It can take a few seconds for the session to initialize.
-
In Cloud Shell, clone the source repository and go to the directory for this tutorial:
git clone https://github.com/GoogleCloudPlatform/bigquery-dlp-remote-function.git cd bigquery-dlp-remote-function/
-
Enable all the required Google Cloud APIs
gcloud services enable \ artifactregistry.googleapis.com \ bigquery.googleapis.com \ bigqueryconnection.googleapis.com \ cloudbuild.googleapis.com \ cloudkms.googleapis.com \ containerregistry.googleapis.com \ dlp.googleapis.com \ run.googleapis.com \ secretmanager.googleapis.com
-
Authenticate using User Application Default Credentials ("ADCs") as a primary authentication method.
gcloud auth application-default login
-
Initialize and run the Terraform script to create all resources:
terraform init && \ terraform apply
-
Once the script successfully completes resources creation, visit BigQuery Console to run the test SQL script
SELECT pii_column, fns.dlp_freetext_encrypt(pii_column) AS dlp_encrypted, fns.dlp_freetext_decrypt(fns.dlp_freetext_encrypt(pii_column)) AS dlp_decrypted FROM UNNEST( [ 'My name is John Doe. My email is john@doe.com', 'Some non PII data', '212-233-4532', 'some script with simple number 1234']) AS pii_column
The query will produce a result similar to following table:
Row pii_column dlp_encrypted dlp_decrypted 1 My name is John Doe. My email is john@doe.com My name is John Doe. My email is BQ_TRF_EMAIL(40):AQy6lGvwKR+AiiRqJpEr+nBzZUzOcjXkXamUugU= My name is John Doe. My email is john@doe.com 2 Some non PII data Some non PII data Some non PII data 3 212-233-4532 BQ_TRF_PH(40):AbovCRcD0RwhqTJh1SuQmEfXERQoAmOKxwC2jc8= 212-233-4532 4 some script with simple number 1234 some script with simple number 1234 some script with simple number 1234
In case you want to customize the deployment, please use following steps:
-
Enable APIs for Compute Engine, Cloud Storage, Dataproc, and Cloud SQL services:
gcloud services enable --project "${PROJECT_ID}" \ artifactregistry.googleapis.com \ bigquery.googleapis.com \ bigqueryconnection.googleapis.com \ cloudbuild.googleapis.com \ cloudkms.googleapis.com \ containerregistry.googleapis.com \ dlp.googleapis.com \ iam.googleapis.com \ run.googleapis.com \ secretmanager.googleapis.com
-
In Cloud Shell, set the Cloud Region that you want to create your BigQuery and Cloud Run resources in:
PROJECT_ID="<PROJECT_ID>" REGION="<REGION_ID>" ARTIFACT_REGISTRY_NAME="<ARTIFACT_DOCKER_REGISTRY_NAME>" CLOUD_RUN_SERVICE_NAME="bq-transform-fns"
For fine-grained access control, using a specialized service account for individual services is recommended.
-
Create a service Account:
RUNNER_SA_NAME="${CLOUD_RUN_SERVICE_NAME}-runner" RUNNER_SA_EMAIL="${RUNNER_SA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com" gcloud iam service-accounts create "${RUNNER_SA_NAME}" \ --project="${PROJECT_ID}" \ --description "Runner for BigQuery remote function execution" \ --display-name "${RUNNER_SA_NAME}"
-
Grant permissions to the service account to access DLP
gcloud projects add-iam-policy-binding "${PROJECT_ID}" \ --member="serviceAccount:${RUNNER_SA_EMAIL}" \ --role='roles/dlp.deidentifyTemplatesReader' gcloud projects add-iam-policy-binding "${PROJECT_ID}" \ --member="serviceAccount:${RUNNER_SA_EMAIL}" \ --role='roles/dlp.user'
This is a containerized SpringBoot application. Create an Artifact Registry to store the application's container image
gcloud artifacts repositories create "${ARTIFACT_REGISTRY_NAME}" \
--repository-format=docker \
--location="${REGION}" \
--description="Docker repository for Bigquery Functions" \
--project="${PROJECT_ID}"
-
Build the application container image using Cloud Build:
gcloud builds submit . \ --project="${PROJECT_ID}" \ --substitutions=_CONTAINER_IMAGE_NAME="${REGION}-docker.pkg.dev/${PROJECT_ID}/${ARTIFACT_REGISTRY_NAME}/${CLOUD_RUN_SERVICE_NAME}:latest" \ --machine-type=e2-highcpu-8
-
Deploy Cloud Run by compiling and deploying Container :
gcloud beta run deploy ${CLOUD_RUN_SERVICE_NAME} \ --image="${REGION}-docker.pkg.dev/${PROJECT_ID}/${ARTIFACT_REGISTRY_NAME}/${CLOUD_RUN_SERVICE_NAME}:latest" \ --execution-environment=gen2 \ --platform=managed \ --region="${REGION}" \ --service-account="${RUNNER_SA_EMAIL}" \ --update-env-vars=PROJECT_ID=${PROJECT_ID} \ --no-allow-unauthenticated \ --project ${PROJECT_ID}
-
Retrieve and save the Cloud Run URL:
RUN_URL="$(gcloud run services describe ${CLOUD_RUN_SERVICE_NAME} --region ${REGION} --project ${PROJECT_ID} --format="get(status.address.url)")"
DLP Deidentify templates makes managing deidentification and inspection easy to document and manage.
DEID_TEMPLATE=$(curl -X POST \
-H "Authorization: Bearer `gcloud auth print-access-token`" \
-H "Accept: application/json" \
-H "Content-Type: application/json" \
-H "X-Goog-User-Project: ${PROJECT_ID}" \
--data-binary "@sample_dlp_deid_config.json" \
"https://dlp.googleapis.com/v2/projects/${PROJECT_ID}/locations/${REGION}/deidentifyTemplates")
DEID_TEMPLATE_NAME=$(echo ${DEID_TEMPLATE} | jq -r '.name')
Note: Recommended practice is to use KMS Wrapped Key for DLP de-identification.
This document uses unwrapped key for simplification of demo purpose.
-
Create BigQuery connection for accessing Cloud Run:
bq mk --connection \ --display_name='External transform function connection' \ --connection_type=CLOUD_RESOURCE \ --project_id="${PROJECT_ID}" \ --location="${REGION}" \ ext-${CLOUD_RUN_SERVICE_NAME}
-
Find the BigQuery Service Account used for the connection:
CONNECTION_SA="$(bq --project_id ${PROJECT_ID} --format json show --connection ${PROJECT_ID}.${REGION}.ext-${CLOUD_RUN_SERVICE_NAME} | jq '.cloudResource.serviceAccountId')"
-
Grant the BigQuery connection Service Account Cloud Run Invoker role for accessing the Cloud Run:
gcloud projects add-iam-policy-binding ${PROJECT_ID} \ --member="serviceAccount:${CONNECTION_SA}" \ --role='roles/run.invoker'
-
Define the BigQuery dataset to create remote functions:
BQ_FUNCTION_DATASET="fns"
-
Create the dataset if it doesn't exist:
bq mk --dataset \ --project_id ${PROJECT_ID} \ --location ${REGION} \ ${BQ_FUNCTION_DATASET}
-
Create DLP de-identification function:
DLP uses default inspection template, if you want to use your custom Inspection template, please add following to the
user_defined_context
:('dlp-inspect-template', '<full inspection template name>')
bq query --project_id ${PROJECT_ID} \ --use_legacy_sql=false \ "CREATE OR REPLACE FUNCTION ${BQ_FUNCTION_DATASET}.dlp_freetext_encrypt(v STRING) RETURNS STRING REMOTE WITH CONNECTION \`${PROJECT_ID}.${REGION}.ext-${CLOUD_RUN_SERVICE_NAME}\` OPTIONS (endpoint = '${RUN_URL}', user_defined_context = [('mode', 'deidentify'),('algo','dlp'),('dlp-deid-template','${DEID_TEMPLATE_NAME}')]);"
-
Create DLP re-identification function:
DLP uses default inspection template, if you want to use your custom Inspection template, please add following to the
user_defined_context
:('dlp-inspect-template', '<full inspection template name>')
bq query --project_id ${PROJECT_ID} \ --use_legacy_sql=false \ "CREATE OR REPLACE FUNCTION ${BQ_FUNCTION_DATASET}.dlp_freetext_decrypt(v STRING) RETURNS STRING REMOTE WITH CONNECTION \`${PROJECT_ID}.${REGION}.ext-${CLOUD_RUN_SERVICE_NAME}\` OPTIONS (endpoint = '${RUN_URL}', user_defined_context = [('mode', 'reidentify'),('algo','dlp'),('dlp-deid-template','${DEID_TEMPLATE_NAME}')]);"
Execute the following query to observe that the remote function is deidentifying and reidentifying the data using SQL:
-
Using BigQuery Workspace
SELECT pii_column, fns.dlp_freetext_encrypt(pii_column) AS dlp_encrypted, fns.dlp_freetext_decrypt(fns.dlp_freetext_encrypt(pii_column)) AS dlp_decrypted FROM UNNEST( [ 'My name is John Doe. My email is john@doe.com', 'Some non PII data', '212-233-4532', 'some script with simple number 1234']) AS pii_column
-
Using Cloud Shell run the following command to run the query on BigQuery:
bq query --project_id ${PROJECT_ID} \ --use_legacy_sql=false \ " SELECT pii_column, ${BQ_FUNCTION_DATASET}.dlp_freetext_encrypt(pii_column) AS dlp_encrypted, ${BQ_FUNCTION_DATASET}.dlp_freetext_decrypt(${BQ_FUNCTION_DATASET}.dlp_freetext_encrypt(pii_column)) AS dlp_decrypted FROM UNNEST( [ 'My name is John Doe. My email is john@doe.com', 'Some non PII data' ]) AS pii_column"
Expected Output:
pii_column dlp_encrypted dlp_decrypted aes_encrypted aes_decrypted My name is John Doe. My email is john@doe.com My name is John Doe. My email is BQ_TOK_FN(40):AQy6lGvwKR+AiiRqJpEr+nBzZUzOcjXkXamUugU= My name is John Doe. My email is john@doe.com qWnwDEY3bDTCV/azveHnxUm24z/DvUcVLZWHyN4Rg16ISvqswcuYXNXsOyI4/d8M My name is John Doe. My email is john@doe.com Some non PII data Some non PII data Some non PII data 37Agm90CqzGXwerJxai/Tf89ffRN9FpPkuuDW+rz7ns= Some non PII data
- The de-identification and re-identification processing occurs through a Cloud Run service. Please provision the Cloud Run CPU/Memory in accordance with your compute requirements
- When using DLP consider following:
- Use at least
--cpu=4 --memory=8Gi
for Cloud Run instance - When using DLP consider following limits for throughput and cost considerations:
- Limit to 10000 or fewer items for DLP processing
- Use at least
- Ensure BigQuery, Cloud Run service and DLP templates are in the same cloud region
- Everything in the project is deleted. If you used an existing project for this tutorial, when you delete it, you also delete any other work you've done in the project.
- Custom project IDs are lost. When you created this project, you might have created a custom project ID that you want to use in the future. To preserve the URLs that use the project ID, such as an appspot.com URL, delete selected resources inside the project instead of deleting the whole project.
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, you can delete the project:
- In the Cloud Console, go to the Manage resources page.
- In the project list, select the project that you want to delete and then click Delete .
- In the dialog, type the project ID and then click Shut down to delete the project.
- Learn more about BigQuery Remote Functions
- Learn more about Cloud DLP
- Learn more about Cloud KMS
- Learn about Inspecting storage and databases for sensitive data
- Handling De-identification and re-identification of PII in large-scale datasets using DLP
License: Apache 2.0
This is not an official Google product.