Performing ETL into Big Query Tutorial Sample Code

This is the sample code for the Big Query Ingestion to Insights Tutorial. The tutorial explains how to ingest highly normalized (OLTP database style) data into Big Query using DataFlow. To understand this sample code it is recommended that you review the [Google Cloud Dataflow programming model]. (https://cloud.google.com/dataflow/).

This tutorial relies on the musicbrainz dataset.

Note that this tutorial assumes that you have Java 8 and maven and Google Cloud SDK installed. It also requires that you have a project, dataflow staging bucket and big query dataset already created.

Overview

The repository consists of the following Java classes:

com.google.cloud.bqetl.BQETLSimple - Simple pipeline for ingesting musicbrainz artist recording data as a flat table of artist's recordings.
com.google.cloud.bqetl.BQETLNested - Revision of the simple pipeline that nests the artist's recordings as a repeated record inside each Big Query table row which pertains to an artist.
com.google.cloud.bqetl.mbdata.MusicBrainzDataObject - a general purpose object to represent a row of MusicBrainz data
com.google.cloud.bqetl.mbdata.MusicBrainzTransforms - the library of functions that implements the transforms used in the above pipelines and allows them to be reused
com.google.cloud.bqetl.json.JSONReader - class for converting a JSON representation of a MusicBrainz tow into a MusicBrainzDataObject
com.google.cloud.bqetl.options.BQETLOptions - the options for the BQETL pipelines
com.google.cloud.bqetl.JSONReaderTest - test for the JSONReader
com.google.cloud.bqetl.MusicBrainzTransformsTest - tests the transforms library

The repository consists of the following scripts and resources:

src/test/resources data files for the test classes
dataflow-staging-policy.json - a policy for expiring objects in the bucket used for staging the dataflow jobs
run-simple.example - example script for running the simple pipeline using maven
run-nested.example - example script for running the nested pipeline using maven
pom.xml - maven build script

Getting Started

copy the files run-simple.example and run-nested.example to run-simple and run-nested respectively. ''' cp run-simple.example run-simple cp run-nested.example run-nested
Edit each file replacing the #STAGING_BUCKET_, #PROJECT, #DATASET with the values specific to your account.
Edit each file replacing #DESTINATION_TABLE with the table you want to load the denormalized data into. Note that to preserve the table for each run you may want to use different destination tables for each of these scripts.
Save the changes for each script.
Run either script as desired.
As an alternative to editing the script you can simply copy and paste the command therein to your shell replacing the aforementioned values with those specific to your project.

Contact Us

We welcome all usage-related questions on Stack Overflow tagged with google-cloud-dataflow.

Please use issue tracker on GitHub to report any bugs, comments or questions regarding SDK development.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
src		src
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
dataflow-staging-policy.json		dataflow-staging-policy.json
pom.xml		pom.xml
run-nested.example		run-nested.example
run-simple.example		run-simple.example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Performing ETL into Big Query Tutorial Sample Code

Overview

Getting Started

Contact Us

More Information

About

Releases

Packages

Languages

License

BrentDorsey/bigquery-etl-dataflow-sample

Folders and files

Latest commit

History

Repository files navigation

Performing ETL into Big Query Tutorial Sample Code

Overview

Getting Started

Contact Us

More Information

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages