This tutorial demonstrates how to use Google Cloud Dataflow to analyze logs collected and exported by Google Cloud Logging. The tutorial highlights support for batch and streaming, multiple data sources, windowing, aggregations, and Google BigQuery output.
For details about how the tutorial works, see Processing Logs at Scale Using Cloud Dataflow on the Google Cloud Platform website.
- Java JDK (version 1.7 or greater)
- Maven (version 3 or greater)
- A Google Cloud Platform account
- Install and setup the Google Cloud SDK
After installing the Google Cloud SDK, run gcloud components update
to install or update the following additional components:
- BigQuery Command Line Tool
- Cloud SDK Core Libraries
- gcloud Alpha Commands
- gcloud Beta Commands
- gcloud app Python Extensions
- kubectl
Set your preferred zone and project:
$ gcloud config set compute/zone ZONE
$ gcloud config set project PROJECT-ID
Ensure the following APIs are enabled in the Google Cloud Console. Navigate to API Manager and enable:
- BigQuery
- Google Cloud Dataflow
- Google Cloud Logging
- Google Cloud Pub/Sub
- Google Cloud Storage
- Google Container Engine
The services
folder contains three simple applications built using Go and the Gin HTTP web framework. These applications generate the logs to be analyzed by the Dataflow pipeline. The applications have been packaged as Docker images and are available through Google Container Registry. Note: If you are interested in editing/updating these applications, refer to the README.
In the services
folder, there are several scripts you can use to facilitate deployment, configuration, and testing of the sample web applications.
First, change the current directory to services
:
$ cd dataflow-log-analytics/services
Next, deploy the Container Engine cluster with the sample web applications:
$ ./cluster.sh PROJECT-ID CLUSTER-NAME up
The script will deploy a single-node Container Engine cluster, deploy the web applications, and expose the applications as Kubernetes services.
The next step is to configure Cloud Logging to export the web application logs to Google Cloud Storage. The following script first creates a Cloud Storage bucket, configures the appropriate permissions, and sets up automated export from the web applications to Cloud Storage. Note: the BUCKET-NAME
should not be an existing Cloud Storage bucket.
$ ./logging.sh PROJECT-ID BUCKET-NAME batch up
Now that the applications have been deployed and are logging through Cloud Logging, you can use the following script to generate requests against the applications:
$ ./load.sh REQUESTS CONCURRENCY
This script uses Apache Bench ab to generate load against the deployed web applications. REQUESTS
controls how many requests are issued to each application and CONCURRENCY
controls how many concurrent requests are issued. The logs from the applications are sent to Cloud Storage in hourly batches, and it can take up to two hours before log entries start to appear. For more information, see the Cloud Logging documentation.
For information on examining logs or log structure in Cloud Storage, see the Cloud Logging documentation.
The following diagram shows the structure and flow of the example Dataflow pipeline:
Before deploying the pipeline, create the BigQuery dataset where output from the Cloud Dataflow pipeline will be stored:
$ gcloud alpha bigquery datasets create DATASET-NAME
First, change the current directory to dataflow
:
$ cd dataflow-log-analytics/dataflow
Next, Run the pipeline. Replace BUCKET-NAME
with the same name you used for the logging setup:
$ ./pipeline.sh PROJECT-ID DATASET-NAME BUCKET-NAME run
This command builds the code for the Cloud Dataflow pipeline, uploads it to the specified staging area, and launches the job. To see all options available for this pipeline, run the following command:
$ ./pipeline.sh
While the pipeline is running, you can see its status in the Google Developers Console. Navigate to Dataflow and then click the running job ID. You can see a graphical rendering of the pipeline and examine job logging output along with information about each pipeline stage. Here is an example screenshot of a running Cloud Dataflow job:
After the job has completed, you can see the output in the BigQuery console and compose and run queries against the data.
To clean up and remove all resources used in this example:
-
Delete the BigQuery dataset:
$ gcloud alpha bigquery datasets delete DATASET-NAME
-
Deactivate the Cloud Logging exports. This step deletes the exports and the specified Cloud Storage bucket:
$ cd dataflow-log-analytics/services $ ./logging.sh PROJECT-ID BUCKET-NAME batch down
-
Delete the Container Engine cluster used to run the sample web applications:
$ cd dataflow-log-analytics/services $ ./cluster.sh PROJECT-ID CLUSTER-NAME down