Embulk::Input::Bigquery

This is Embulk input plugin from Bigquery.

Installation

install it yourself as:

$ embulk gem install embulk-input-bigquery

Configuration

Options

Query Options

This plugin uses the gem google-cloud(Google Cloud Client Library for Ruby) and queries data using the synchronous method. Optional configuration items comply with the Google Cloud Client Library.

name	type	required?	default	description
max	integer	optional	`null`	The maximum number of rows of data to return per page of results. Setting this flag to a small value such as 1000 and then paging through results might improve reliability when the query result set is large. In addition to this limit, responses are also limited to 10 MB. By default, there is no maximum row count, and only the byte limit applies.
cache	boolean	optional	true	Whether to look for the result in the query cache. The query cache is a best-effort cache that will be flushed whenever tables in the query are modified. The default value is true. For more information, see query caching.
standard_sql	boolean	optional	true	Specifies whether to use BigQuery's standard SQL dialect for this query. If set to true, the query will use standard SQL rather than the legacy SQL dialect. When set to true, the values of `large_results` and `flatten` are ignored; the query will be run as if `large_results` is true and `flatten` is false. Optional. The default value is true.
legacy_sql	boolean	optional	false	legacy_sql Specifies whether to use BigQuery's legacy SQL dialect for this query. If set to false, the query will use BigQuery's standard SQL When set to false, the values of `large_results` and `flatten` are ignored; the query will be run as if `large_results` is true and `flatten` is false. Optional. The default value is false.
location	string	optional	`null`	If your data is in a location other than the US or EU multi-region, you must specify the location. See also Dataset Locations \| BigQuery \| Google Cloud

Example

in:
  type: bigquery
  project: 'project-name'
  keyfile: '/home/hogehoge/bigquery-keyfile.json'
  sql: 'SELECT price,category_id FROM [ecsite.products] GROUP BY category_id'
  columns:
    - {name: price, type: long}
    - {name: category_id, type: string}
  max: 2000

  # # If your data is in a location other than the US or EU multi-region, you must specify the location.
  # location: asia-northeast1
out:
  type: stdout

If the table name is changeable, then

in:
  type: bigquery
  project: 'project-name'
  keyfile: '/home/hogehoge/bigquery-keyfile.json'
  sql_erb: 'SELECT price,category_id FROM [ecsite.products_<%= params["date"].strftime("%Y%m")  %>] GROUP BY category_id'
  erb_params:
    date: "require 'date'; (Date.today - 1)"
  columns:
    - {name: price, type: long}
    - {name: category_id, type: long}
    - {name: month, type: timestamp, format: '%Y-%m', eval: 'require "time"; Time.parse(params["date"]).to_i'}

Authentication

JSON key of GCP's service account

You first need to create a service account (client ID), download its json key and deploy the key with embulk.

in:
  type: bigquery
  project: project_name
  keyfile: /path/to/keyfile.json

You can also embed contents of json_keyfile at config.yml.

in:
  type: bigquery
  project: project_name
  keyfile:
    content: |
      {
        "type": "service_account",
        "project_id": "example-project",
        "private_key_id": "1234567890ABCDEFG",
        "private_key": "**************************************",
        "client_email": "example-project@hogehoge.gserviceaccount.com",
        "client_id": "12345678901234567890",
        "auth_uri": "https://accounts.google.com/o/oauth2/auth",
        "token_uri": "https://accounts.google.com/o/oauth2/token",
        "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
        "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/hogehoge.gcp.iam.gserviceaccount.com"
      }

Automatically determine column schema from query results

Column schema can be automatically determined from query results if columns definition is not given. Please note that we have to wait until BigQuery query job complets to get the schema information.

in:
  type: bigquery
  project: project_name
  keyfile: /path/to/keyfile.json
  sql: 'SELECT price,category_id FROM [ecsite.products] GROUP BY category_id'
out:
  type: stdout

Another Choice

embulk-input-bigquery queries to BigQuery, so it costs. To save money, you may take following procedures instead:

Export data from BigQuery to GCS with avro format
Use embulk-input-gcs and embulk-parser-avro to read the exported data from GCS.

Development

Run

embulk bundle install --path vendor/bundle
embulk run -X page_size=1 -b . -l trace example/example.yml

Release gem

Upgrade lib/embulk/input/bigquery/version.rb, then

$ bundle exec rake release

ChangeLog

CHANGELOG.md

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
lib/embulk/input		lib/embulk/input
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Gemfile		Gemfile
LICENSE.txt		LICENSE.txt
README.md		README.md
Rakefile		Rakefile
embulk-input-bigquery.gemspec		embulk-input-bigquery.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Embulk::Input::Bigquery

Installation

Configuration

Options

Query Options

Example

Authentication

JSON key of GCP's service account

Automatically determine column schema from query results

Another Choice

Development

Run

Release gem

ChangeLog

About

Releases

Packages

Contributors 11

Languages

License

medjed/embulk-input-bigquery

Folders and files

Latest commit

History

Repository files navigation

Embulk::Input::Bigquery

Installation

Configuration

Options

Query Options

Example

Authentication

JSON key of GCP's service account

Automatically determine column schema from query results

Another Choice

Development

Run

Release gem

ChangeLog

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 11

Languages

Packages