diff --git a/CHANGES.md b/CHANGES.md index ed30915fc..51db2421e 100644 --- a/CHANGES.md +++ b/CHANGES.md @@ -1,5 +1,17 @@ # Release Notes +## 0.17.0 - 2020-07-15 +* PR #201: [Structured streaming write](http://spark.apache.org/docs/2.4.5/structured-streaming-programming-guide.html#starting-streaming-queries) + is now supported (thanks @varundhussa) +* PR #202: Users now has the option to keep the data on GCS after writing to BigQuery (thanks @leoneuwald) +* PR #211: Enabling to overwrite data of a single date partition +* PR #198: Supporting columnar batch reads from Spark in the DataSource V2 implementation. **It is not ready for production use.** +* PR #192: Supporting `MATERIALIZED_VIEW` as table type +* Issue #197: Conditions on StructType fields are now handled by Spark and not the connector +* BigQuery API has been upgraded to version 1.116.3 +* BigQuery Storage API has been upgraded to version 1.0.0 +* Netty has been upgraded to version 4.1.48.Final (Fixing issue #200) + ## 0.16.1 - 2020-06-11 * PR #186: Fixed SparkBigQueryConnectorUserAgentProvider initialization bug diff --git a/README.md b/README.md index e3279952b..473a76175 100644 --- a/README.md +++ b/README.md @@ -76,8 +76,8 @@ repository. It can be used using the `--packages` option or the | Scala version | Connector Artifact | | --- | --- | -| Scala 2.11 | `com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.16.1` | -| Scala 2.12 | `com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.16.1` | +| Scala 2.11 | `com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.17.0` | +| Scala 2.12 | `com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.17.0` | ## Hello World Example @@ -136,7 +136,10 @@ df.write .save("dataset.table") ``` -When writing a streaming DataFrame to BigQuery, each batch is written in the same manner as a non-streaming DataFrame. Note that a HDFS compatible [checkpoint location](http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing) (eg: path/to/HDFS/dir or gs://checkpointBucket/checkpointDir) must be specified. +When streaming a DataFrame to BigQuery, each batch is written in the same manner as a non-streaming DataFrame. +Note that a HDFS compatible +[checkpoint location](http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing) +(eg: `path/to/HDFS/dir` or `gs://checkpoint-bucket/checkpointDir`) must be specified. ``` df.writeStream @@ -146,7 +149,7 @@ df.writeStream .option("table", "dataset.table") ``` -*Important:* The connector does not configure the GCS connector, in order to avoid conflict with another GCS connector, if exists. In order to use the write capabilities of the connector, please configure the GCS connector on your cluster as explained [here](https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs). +**Important:** The connector does not configure the GCS connector, in order to avoid conflict with another GCS connector, if exists. In order to use the write capabilities of the connector, please configure the GCS connector on your cluster as explained [here](https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs). ### Properties @@ -220,7 +223,7 @@ The API Supports a number of options to configure the read Read - viewMaterializationProject + materializationProject The project id where the materialized view is going to be created
(Optional. Defaults to view's project id) @@ -228,7 +231,7 @@ The API Supports a number of options to configure the read Read - viewMaterializationDataset + materializationDataset The dataset where the materialized view is going to be created
(Optional. Defaults to view's dataset) @@ -311,7 +314,6 @@ The API Supports a number of options to configure the read Write - partitionField @@ -385,6 +386,11 @@ The API Supports a number of options to configure the read +Options can also be set outside of the code, using the `--conf` parameter of `spark-submit` or `--properties` parameter +of the `gcloud dataproc submit spark`. In order to use this, prepend the prefix `spark.datasource.bigquery.` to any of +the options, for example `spark.conf.set("temporaryGcsBucket", "some-bucket")` can also be set as +`--conf spark.datasource.bigquery.temporaryGcsBucket=some-bucket`. + ### Data types With the exception of `DATETIME` and `TIME` all BigQuery data types directed map into the corresponding Spark SQL data type. Here are all of the mappings: @@ -579,7 +585,7 @@ using the following code: ```python from pyspark.sql import SparkSession spark = SparkSession.builder\ - .config("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.16.1")\ + .config("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.17.0")\ .getOrCreate() df = spark.read.format("bigquery")\ .load("dataset.table") @@ -588,7 +594,7 @@ df = spark.read.format("bigquery")\ **Scala:** ```python val spark = SparkSession.builder - .config("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.16.1") + .config("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.17.0") .getOrCreate() val df = spark.read.format("bigquery") .load("dataset.table") @@ -596,7 +602,7 @@ val df = spark.read.format("bigquery") In case Spark cluster is using Scala 2.12 (it's optional for Spark 2.4.x, mandatory in 3.0.x), then the relevant package is -com.google.cloud.spark:spark-bigquery-with-dependencies_**2.12**:0.16.1. In +com.google.cloud.spark:spark-bigquery-with-dependencies_**2.12**:0.17.0. In order to know which Scala version is used, please run the following code: **Python:** @@ -620,14 +626,14 @@ To include the connector in your project: com.google.cloud.spark spark-bigquery-with-dependencies_${scala.version} - 0.16.1 + 0.17.0 ``` ### SBT ```sbt -libraryDependencies += "com.google.cloud.spark" %% "spark-bigquery-with-dependencies" % "0.16.1" +libraryDependencies += "com.google.cloud.spark" %% "spark-bigquery-with-dependencies" % "0.17.0" ``` ## Building the Connector diff --git a/build.sbt b/build.sbt index 6c21803e0..6c4dc4b20 100644 --- a/build.sbt +++ b/build.sbt @@ -24,7 +24,7 @@ lazy val nettyTcnativeVersion = "2.0.29.Final" lazy val commonSettings = Seq( organization := "com.google.cloud.spark", - version := "0.16.2-SNAPSHOT", + version := "0.17.0", scalaVersion := scala211Version, crossScalaVersions := Seq(scala211Version, scala212Version) )