Releases: GoogleCloudDataproc/spark-bigquery-connector
Releases · GoogleCloudDataproc/spark-bigquery-connector
0.17.0
New Features
- Structured streaming write is now supported (PR #201, thanks @varundhussa)
- Users now has the option to keep the data on GCS after writing to BigQuery (PR #202, thanks @leoneuwald)
- Enabling to overwrite data of a single date partition (PR #211)
- Supporting
MATERIALIZED_VIEW
as table type (PR #192) - Supporting columnar batch reads from Spark in the DataSource V2 implementation. (PR #198) It is not ready for production use.
Bug Fixes
- Conditions on StructType fields are now handled by Spark and not the connector, Fixing Issue #197
Dependency Updates
- BigQuery API has been upgraded to version 1.116.3
- BigQuery Storage API has been upgraded to version 1.0.0
- Netty has been upgraded to version 4.1.48.Final (Fixing issue #200)
0.16.1
New Features
- Apache Arrow is now the default read format. Based on our benchmarking, Arrow provides read performance faster by 40% then Avro. (PR #180)
- Apache Avro has been added as a write intermediate format. Based on our testing it shows performance improvements when the DataFrame is larger than 50GB (PR #163)
- Usage simplification: Now instead of using the
table
mandatory option, user can use the built inpath
parameter ofload()
andsave()
, so that read becomesdf = spark.read.format("bigquery").load("source_table")
and write becomesdf.write.format("bigquery").save("target_table")
(PR #176) - An experimental implementation of the DataSource v2 API has been added. It is not ready for production use.
Dependency Updates
- BigQuery API has been upgraded to version 1.116.1
- BigQuery Storage API has been upgraded to version 0.133.2-beta
- gRPC has been upgraded to version 1.29.0
- Guava has been upgraded to version 29.0-jre
0.15.1-beta
0.15.0-beta
- PR #150: Reading
DataFrame
s should be quicker, especially in interactive usage such in notebooks - PR #154: Upgraded to the BigQuery Storage v1 API
- PR #146: Authentication can be done using AccessToken on top of Credentials file, Credentials, and the
GOOGLE_APPLICATION_CREDENTIALS
environment variable.
0.14.0-beta
- Issue #96: Added Arrow as a supported format for reading from BigQuery
- Issue #130 Adding the field description to the schema metadata
- Issue #124: Fixing null values in ArrayType
- Issue #143: Allowing the setting of
SchemaUpdateOption
s When writing to BigQuery - PR #148: Add support for writing clustered tables
- Upgrade version of google-cloud-bigquery library to 1.110.0
- Upgrade version of google-cloud-bigquerystorage library to 0.126.0-beta
0.13.1-beta
- changed the parallelism parameter to
maxParallelism
in order to reflect the Change in the underlining API (the old parameter has been deprecated) - Upgrade version of google-cloud-bigquerystorage library to 0.122.0-beta.
- Issue #73: Optimized empty projection used for count() execution.
- Issue #121: Added the option to configure CreateDisposition when inserting data to BigQuery.
Notice: Version 0.13.0-beta also included an upgrade to version v1beta2 of the BigStorage API. Due to issues discovered when used with custom API roles, it has been deprecated and is not recommended to use. The 0.13.1-beta version of the connector is using version v1beta1 of the BigStorage API, also used in the previous versions.
0.12.0-beta
- Issue #72: Moved the shaded jar name from classifier to a new artifact name. Now it
is easier to use the connector within Jupyter notebooks - Issues #73, #87: Added better logging to help understand which columns and filters
are asked by spark, and which are passed down to BigQuery - Issue #107: The connector will now alert when is is used with the wrong scala version