Processing Logs at Scale Using Cloud Dataflow

This tutorial demonstrates how to use Google Cloud Dataflow to analyze logs collected and exported by Google Cloud Logging. The tutorial highlights support for batch and streaming, multiple data sources, windowing, aggregations, and Google BigQuery output.

For details about how the tutorial works, see Processing Logs at Scale Using Cloud Dataflow on the Google Cloud Platform website.

Prerequisites

Java JDK (version 1.7 or greater)
Maven (version 3 or greater)
A Google Cloud Platform account
Install and setup the Google Cloud SDK

After installing the Google Cloud SDK, run gcloud components update to install or update the following additional components:

BigQuery Command Line Tool
Cloud SDK Core Libraries
gcloud Alpha Commands
gcloud Beta Commands
gcloud app Python Extensions
kubectl

Set your preferred zone and project:

$ gcloud config set compute/zone ZONE
$ gcloud config set project PROJECT-ID

Ensure the following APIs are enabled in the Google Cloud Console. Navigate to API Manager and enable:

BigQuery
Google Cloud Dataflow
Google Cloud Logging
Google Cloud Pub/Sub
Google Cloud Storage
Google Container Engine

Sample Web Applications

The services folder contains three simple applications built using Go and the Gin HTTP web framework. These applications generate the logs to be analyzed by the Dataflow pipeline. The applications have been packaged as Docker images and are available through Google Container Registry. Note: If you are interested in editing/updating these applications, refer to the README.

In the services folder, there are several scripts you can use to facilitate deployment, configuration, and testing of the sample web applications.

Deploy the Container Engine cluster

First, change the current directory to services:

$ cd dataflow-log-analytics/services

Next, deploy the Container Engine cluster with the sample web applications:

$ ./cluster.sh PROJECT-ID CLUSTER-NAME up

The script will deploy a single-node Container Engine cluster, deploy the web applications, and expose the applications as Kubernetes services.

Set up Cloud Logging

The next step is to configure Cloud Logging to export the web application logs to Google Cloud Storage. The following script first creates a Cloud Storage bucket, configures the appropriate permissions, and sets up automated export from the web applications to Cloud Storage. Note: the BUCKET-NAME should not be an existing Cloud Storage bucket.

$ ./logging.sh PROJECT-ID BUCKET-NAME batch up

Generate requests

Now that the applications have been deployed and are logging through Cloud Logging, you can use the following script to generate requests against the applications:

$ ./load.sh REQUESTS CONCURRENCY

This script uses Apache Bench ab to generate load against the deployed web applications. REQUESTS controls how many requests are issued to each application and CONCURRENCY controls how many concurrent requests are issued. The logs from the applications are sent to Cloud Storage in hourly batches, and it can take up to two hours before log entries start to appear. For more information, see the Cloud Logging documentation.

Examining logs

For information on examining logs or log structure in Cloud Storage, see the Cloud Logging documentation.

Cloud Dataflow pipeline

The following diagram shows the structure and flow of the example Dataflow pipeline:

Create the BigQuery dataset

Before deploying the pipeline, create the BigQuery dataset where output from the Cloud Dataflow pipeline will be stored:

$ gcloud alpha bigquery datasets create DATASET-NAME

Run the pipeline

First, change the current directory to dataflow:

$ cd dataflow-log-analytics/dataflow

Next, Run the pipeline. Replace BUCKET-NAME with the same name you used for the logging setup:

$ ./pipeline.sh PROJECT-ID DATASET-NAME BUCKET-NAME run

This command builds the code for the Cloud Dataflow pipeline, uploads it to the specified staging area, and launches the job. To see all options available for this pipeline, run the following command:

$ ./pipeline.sh

Monitoring the pipeline

While the pipeline is running, you can see its status in the Google Developers Console. Navigate to Dataflow and then click the running job ID. You can see a graphical rendering of the pipeline and examine job logging output along with information about each pipeline stage. Here is an example screenshot of a running Cloud Dataflow job:

View BigQuery data

After the job has completed, you can see the output in the BigQuery console and compose and run queries against the data.

Cleaning up

To clean up and remove all resources used in this example:

Delete the BigQuery dataset:

 $ gcloud alpha bigquery datasets delete DATASET-NAME

Deactivate the Cloud Logging exports. This step deletes the exports and the specified Cloud Storage bucket:
```
 $ cd dataflow-log-analytics/services
 $ ./logging.sh PROJECT-ID BUCKET-NAME batch down
```

Delete the Container Engine cluster used to run the sample web applications:

 $ cd dataflow-log-analytics/services
 $ ./cluster.sh PROJECT-ID CLUSTER-NAME down

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
dataflow		dataflow
images		images
services		services
.gitignore		.gitignore
CONTRIBUTING		CONTRIBUTING
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Processing Logs at Scale Using Cloud Dataflow

Prerequisites

Sample Web Applications

Deploy the Container Engine cluster

Set up Cloud Logging

Generate requests

Examining logs

Cloud Dataflow pipeline

Create the BigQuery dataset

Run the pipeline

Monitoring the pipeline

View BigQuery data

Cleaning up

About

Releases

Packages

Languages

License

jacoboliver/processing-logs-using-dataflow

Folders and files

Latest commit

History

Repository files navigation

Processing Logs at Scale Using Cloud Dataflow

Prerequisites

Sample Web Applications

Deploy the Container Engine cluster

Set up Cloud Logging

Generate requests

Examining logs

Cloud Dataflow pipeline

Create the BigQuery dataset

Run the pipeline

Monitoring the pipeline

View BigQuery data

Cleaning up

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages