From 5a9cafd797392e0568f8d86863b7f6b4756d0446 Mon Sep 17 00:00:00 2001 From: Adrian Velonis Date: Wed, 27 Nov 2024 11:59:28 -0600 Subject: [PATCH] PD-5304: Databricks corrections --- ...-set-unset-extended-table-properties.flsnp | 2 +- .../change-type-tbl-properties.flsnp | 2 +- ...ks-cluster-columns-partition-columns.flsnp | 6 + .../database-databricks-change-types.flsnp | 2 +- Content/change-types/create-table.html | 30 ++-- .../databricks/alter-cluster.html | 19 ++- .../databricks/analyze-table.html | 4 +- .../nested-tags/cluster-columns.html | 118 --------------- .../extended-table-properties.html | 142 +++++++++++++++++- .../databricks/optimize-table.html | 4 +- Project/TOCs/TOC.fltoc | 3 - 11 files changed, 179 insertions(+), 153 deletions(-) create mode 100644 Content/Z_Resources/Snippets/note/change-type-databricks-cluster-columns-partition-columns.flsnp delete mode 100644 Content/change-types/databricks/nested-tags/cluster-columns.html diff --git a/Content/Z_Resources/Snippets/def/attributes/change-types/databricks/change-type-set-unset-extended-table-properties.flsnp b/Content/Z_Resources/Snippets/def/attributes/change-types/databricks/change-type-set-unset-extended-table-properties.flsnp index 8ebe645c6..68fc8235c 100644 --- a/Content/Z_Resources/Snippets/def/attributes/change-types/databricks/change-type-set-unset-extended-table-properties.flsnp +++ b/Content/Z_Resources/Snippets/def/attributes/change-types/databricks/change-type-set-unset-extended-table-properties.flsnp @@ -5,7 +5,7 @@

Optional.

-

Specifies additional properties about the table. You can use this to specify new properties or replace existing ones.

+

Specifies additional properties. You can use this to specify new properties or replace existing ones.

setExtendedTableProperties has the following nested attributes:

 s that accept Databricks s or sub-tags:

diff --git a/Content/change-types/create-table.html b/Content/change-types/create-table.html index 439289e32..8bed9ffea 100644 --- a/Content/change-types/create-table.html +++ b/Content/change-types/create-table.html @@ -144,18 +144,10 @@

Nested tags

all yes - - clusterColumns - - Creates a clustered table. -   - databricks - no - extendedTableProperties - Specifies additional properties on a table you're creating. + Specifies additional properties on a table you're creating, such as whether to create clustered or partitioned columns.   databricks no @@ -202,6 +194,7 @@

Examples

author: your.name changes: - createTable: + tableName: test_table_complex_types columns: - column: name: my_arrs @@ -215,7 +208,11 @@

Examples

- column: name: my_struct type: 'STRUCT<FIELD1: STRING NOT NULL, FIELD2: INT>' - tableName: test_table_complex_types + extendedTableProperties: + clusterColumns: my_arrs, my_arrbi + tableFormat: delta + tableLocation: s3://databricks-external-folder/test_table_properties + tblProperties: 'this.is.my.key'=12,'this.is.my.key2'=true

General example:

{
@@ -256,6 +253,7 @@ 

Examples

"changes": [ { "createTable": { + "tableName": "test_table_complex_types", "columns": [ { "column": { @@ -282,7 +280,12 @@

Examples

} } ], - "tableName": "test_table_complex_types" + "extendedTableProperties": { + "clusterColumns": "my_arrs, my_arrbi", + "tableFormat": "delta", + "tableLocation": "s3://databricks-external-folder/test_table_properties", + "tblProperties": "'this.is.my.key'=12,'this.is.my.key2'=true" + } } } ] @@ -313,6 +316,11 @@

Examples

<column name="my_arrbi" type="ARRAY&lt;BIGINT&gt;" /> <column name="my_map" type="MAP&lt;STRING, BIGINT&gt;" /> <column name="my_struct" type="STRUCT&lt;FIELD1: STRING NOT NULL, FIELD2: INT&gt;" /> + + <databricks:extendedTableProperties clusterColumns="my_arrs, my_arrbi" + tableFormat="delta" + tableLocation="s3://databricks-external-folder/test_table_properties" + tblProperties="'this.is.my.key'=12,'this.is.my.key2'=true"/> </createTable> </changeSet> diff --git a/Content/change-types/databricks/alter-cluster.html b/Content/change-types/databricks/alter-cluster.html index a94a073f6..235e1354f 100644 --- a/Content/change-types/databricks/alter-cluster.html +++ b/Content/change-types/databricks/alter-cluster.html @@ -9,9 +9,9 @@

alterCluster

alterCluster is a in the Databricks extension that alters a cluster on a table.

-

To create a cluster, see clusterColumns.

+

To create a new table with a cluster, see extendedTableProperties.

Uses

-

Clustered columns can help optimize performance for some database queries. If you have previously created a table with one or more clustered columns, you can modify which columns are clustered using alterCluster. Specify which columns to cluster using clusterBy.

+

Clustered columns can help optimize performance for some database queries. If you have previously created a table with one or more clustered columns, you can modify which columns are clustered using alterCluster. Specify which columns to de-cluster by using clusterBy. You can also specify a new column to override existing clustering logic.

Changing which columns are clustered can be useful if your data changes significantly or if you begin using different filters to query your data. Better clustering can improve the read efficiency of the new queries.

Databricks does not allow you to drop tables containing clustered columns. You can use alterCluster to remove clustering and then drop the table.

For more information, see Use liquid clustering for Delta tables and ALTER TABLE.

@@ -48,7 +48,7 @@

clusterBy

Optional.

-

Specifies how to cluster the table. Use this to remove clustering from a column.

+

Specifies how to cluster the table. Use this only to remove clustering from a column, not add clustering.

clusterBy has the following nested attributes:

  • none (Boolean) (required): if true, turns off clustering for the table being altered. If false, throws an error.
  • @@ -56,7 +56,7 @@

    clusterBy

    columns/column

    Optional.

    -

    An array of column objects that describes columns in the table. The column order does not matter.

    +

    An array of column objects that describes columns in the table. The column order does not matter. Use this to overwrite an existing CLUSTER BY SQL clause or add clustering to a column.

    column has the following nested attributes:

    • name (string) (required): the name of the column to alter.
    • @@ -139,9 +139,16 @@

      Examples

      </databaseChangeLog>
-

Troubleshooting

+

clusterBy parsing error

If you set the clusterBy  none=false, throws this error:

Unexpected error running Liquibase: Error parsing line 13 column 49 of generated.xml: cvc-enumeration-valid: Value 'false' is not facet-valid with respect to enumeration '[true]'. It must be a value from the enumeration.
+

The purpose of this is to remove clustering from a column, so it only accepts none=true.

+

If you were trying to add a clustered column to an existing table, you must simply use alterCluster to specify column. In this case, you must omit clusterBy.

+

If you were trying to create a new table with clustering, you must use createTable to specify extendedTableProperties. Then, you can use clusterColumns to specify the columns you want to cluster.

+

clusterBy and columns both null

+

If you don't specify either clusterBy or columns, throws this error:

Alter Cluster change require list of columns or element 'ClusterBy', please add at least one option.
+

If you were trying to change the name of your table, you must use renameTable instead.

+

Related links

\ No newline at end of file diff --git a/Content/change-types/databricks/analyze-table.html b/Content/change-types/databricks/analyze-table.html index 61bb01e6f..c547ec9ed 100644 --- a/Content/change-types/databricks/analyze-table.html +++ b/Content/change-types/databricks/analyze-table.html @@ -42,7 +42,7 @@

Available analyzeColumns String - Name of the column(s) to analyze. Specify multiple columns in a comma-separated list. + Name of the column(s) to analyze. Separate multiple values using commas. Optional @@ -99,7 +99,7 @@

Examples

Related links

diff --git a/Content/change-types/databricks/nested-tags/cluster-columns.html b/Content/change-types/databricks/nested-tags/cluster-columns.html deleted file mode 100644 index b56b7e236..000000000 --- a/Content/change-types/databricks/nested-tags/cluster-columns.html +++ /dev/null @@ -1,118 +0,0 @@ - - - <MadCap:variable name="Heading.Level1" /> - - - - - -

clusterColumns -

-

clusterColumns is a tag in the Databricks extension that lets you create a clustered table. It is a sub-tag of the  createTable .

-

To modify an existing cluster, see alterCluster.

-

Uses

-

When you use with Databricks, you create tables using the createTable . Within that , you can specify clusterColumns to create clustered columns on the table.

-

Clustered columns are an alternative to partitioned columns. They can help optimize performance for some database queries. For example, if you often filter a large table for a small subset of data, storing that data in clustered columns can improve the efficiency of read operations. Databricks recommends using clustering for all new tables you create.

-

Databricks does not allow you to drop tables containing clustered columns. You must use alterCluster to remove clustering before you can drop the table.

-

For more information and use-cases, see Use liquid clustering for Delta tables.

-

Run clusterColumns

- - -

Examples

-
- -
databaseChangeLog:
-  - changeSet:
-      id: 1
-      author: your.name
-      changes:
-        - createTable:
-            tableName: test_table_clustered_new
-            columns:
-              - column:
-                  name: test_id
-                  type: int
-              - column:
-                  name: test_new
-                  type: int
-            clusterColumns: test_id, test_new
-      rollback:
-        dropTable:
-          tableName: test_table_clustered_new
-
-
{
-  "databaseChangeLog": [
-    {
-      "changeSet": {
-        "id": "1",
-        "author": "your.name",
-        "changes": [
-          {
-            "createTable": {
-              "tableName": "test_table_clustered_new",
-              "columns": [
-                {
-                  "column": {
-                    "name": "test_id",
-                    "type": "int"
-                  }
-                },
-                {
-                  "column": {
-                    "name": "test_new",
-                    "type": "int"
-                  }
-                }
-              ],
-              "clusterColumns": "test_id,test_new"
-            }
-          }
-        ],
-        "rollback": [
-          {
-            "dropTable": {
-              "tableName": "test_table_clustered_new"
-            }
-          }
-        ]
-      }
-    }
-  ]
-}
-
-

-
-    <changeSet id="1" author="your.name">
-        <createTable tableName="test_table_clustered_new">
-            <column name="test_id" type="int" />
-            <column name="test_new" type="int"/>
-            <databricks:clusterColumns>test_id,test_new</databricks:clusterColumns>
-        </createTable>
-
-        <rollback>
-            <dropTable tableName="test_table_clustered_new"/>
-        </rollback>
-    </changeSet>
-
-</databaseChangeLog>
-
-
- -

Related links

- - - \ No newline at end of file diff --git a/Content/change-types/databricks/nested-tags/extended-table-properties.html b/Content/change-types/databricks/nested-tags/extended-table-properties.html index 5c51ada4e..57495d6d1 100644 --- a/Content/change-types/databricks/nested-tags/extended-table-properties.html +++ b/Content/change-types/databricks/nested-tags/extended-table-properties.html @@ -32,6 +32,17 @@

Available clusterColumns + + String + +

The columns to cluster. Clusters are an alternative to partitions. They can help optimize performance for some database queries, such as read operations on some table filters. Databricks recommends using clustering for all new tables you create. For more information, see Use liquid clustering for Delta tables and CLUSTER BY clause (TABLE).

+ +

Databricks does not allow you to drop tables containing clustered columns. You must use alterCluster to remove clustering before you can drop the table.

+ + Optional + tableFormat @@ -59,7 +70,10 @@

Available partitionColumns String - The columns to partition. The column order does not matter. Using partitions can speed up table queries and data manipulation. Partitions are an alternative to clusters. For more information, see Partitions and When to partition tables on Databricks. + +

The columns to partition. The column order does not matter. Using partitions can speed up table queries and data manipulation. Partitions are an alternative to clusters. For more information, see Partitions and When to partition tables on Databricks.

+ + Optional @@ -74,7 +88,8 @@

Examples

  • XML example
  • -
    databaseChangeLog:
    +            
    +

    With clustered columns:

    databaseChangeLog:
       - changeSet:
           id: 1
           author: your.name
    @@ -83,11 +98,39 @@ 

    Examples

    tableName: test_table_properties columns: - column: - name: test_id + name: id type: int constraints: primaryKey: true nullable: false + - column: + name: some_column + type: int + extendedTableProperties: + clusterColumns: id, some_column + tableFormat: delta + tableLocation: s3://databricks-external-folder/test_table_properties + tblProperties: 'this.is.my.key'=12,'this.is.my.key2'=true + rollback: + dropTable: + tableName: test_table_properties
    +

    With partitioned columns:

    databaseChangeLog:
    +  - changeSet:
    +      id: 1
    +      author: your.name
    +      changes:
    +        - createTable:
    +            tableName: test_table_properties
    +            columns:
    +              - column:
    +                  name: id
    +                  type: int
    +                  constraints:
    +                    primaryKey: true
    +                    nullable: false
    +              - column:
    +                  name: some_column
    +                  type: int
                 extendedTableProperties:
                   partitionColumns: id, some_column
                   tableFormat: delta
    @@ -97,7 +140,56 @@ 

    Examples

    dropTable: tableName: test_table_properties
    -
    {
    +            
    +

    With clustered columns:

    {
    +  "databaseChangeLog": [
    +    {
    +      "changeSet": {
    +        "id": "1",
    +        "author": "your.name",
    +        "changes": [
    +          {
    +            "createTable": {
    +              "tableName": "test_table_properties",
    +              "columns": [
    +                {
    +                  "column": {
    +                    "name": "id",
    +                    "type": "int",
    +                    "constraints": {
    +                      "primaryKey": "true",
    +                      "nullable": "false"
    +                    }
    +                  }
    +                },
    +                {
    +                  "column": {
    +                    "name": "some_column",
    +                    "type": "int"
    +                  }
    +                }
    +              ],
    +              "extendedTableProperties": {
    +                "clusterColumns": "id, some_column",
    +                "tableFormat": "delta",
    +                "tableLocation": "s3://databricks-external-folder/test_table_properties",
    +                "tblProperties": "'this.is.my.key'=12,'this.is.my.key2'=true"
    +              }
    +            }
    +          }
    +        ],
    +        "rollback": [
    +          {
    +            "dropTable": {
    +              "tableName": "test_table_properties"
    +            }
    +          }
    +        ]
    +      }
    +    }
    +  ]
    +}
    +

    With partitioned columns:

    {
       "databaseChangeLog": [
         {
           "changeSet": {
    @@ -110,13 +202,19 @@ 

    Examples

    "columns": [ { "column": { - "name": "test_id", + "name": "id", "type": "int", "constraints": { "primaryKey": "true", "nullable": "false" } } + }, + { + "column": { + "name": "some_column", + "type": "int" + } } ], "extendedTableProperties": { @@ -140,13 +238,36 @@

    Examples

    ] }
    -
    
    +            
    +

    With clustered columns:

    
    +
    +    <changeSet id="1" author="your.name">
    +        <createTable tableName="test_table_properties">
    +            <column name="id" type="int" >
    +                <constraints primaryKey="true" nullable="false"/>
    +            </column>
    +            <column name="some_column" type="int"/>
    +
    +            <databricks:extendedTableProperties clusterColumns="id, some_column"
    +                                                tableFormat="delta"
    +                                                tableLocation="s3://databricks-external-folder/test_table_properties"
    +                                                tblProperties="'this.is.my.key'=12,'this.is.my.key2'=true"/>
    +        </createTable>
    +
    +        <rollback>
    +            <dropTable tableName="test_table_properties"/>
    +        </rollback>
    +    </changeSet>
    +
    +</databaseChangeLog>
    +

    With partitioned columns:

    
     
         <changeSet id="1" author="your.name">
             <createTable tableName="test_table_properties">
    -            <column name="test_id" type="int" >
    +            <column name="id" type="int" >
                     <constraints primaryKey="true" nullable="false"/>
                 </column>
    +            <column name="some_column" type="int"/>
     
                 <databricks:extendedTableProperties partitionColumns="id, some_column"
                                                     tableFormat="delta"
    @@ -162,6 +283,13 @@ 

    Examples

    </databaseChangeLog>
    +

    Troubleshooting

    +

    Clustered and partitioned columns collision

    +

    If you specify values for clusterColumns and partitionColumns on the same table, throws this error:

    Databricks does not support CLUSTER columns AND PARTITION BY columns, please pick one.
    +

    Instead, you must specify either clusterColumns or partitionColumns, but not both.

    +

    Extended table properties double initialization

    +

    It is technically possible to specify some Databricks-specific s directly in createTable instead of in extendedTableProperties. This is not a best practice. If you specify Databricks-specific s in both places, throws this error:

    Double initialization of extended table properties is not allowed. Please avoid using both EXT createTable attributes and Databricks specific extendedTableProperties element. Element databricks:extendedTableProperties is preferred way to set databricks specific configurations.
    +

    Instead, you must specify all Databricks-specific s in extendedTableProperties.

    Related links

      diff --git a/Content/change-types/databricks/optimize-table.html b/Content/change-types/databricks/optimize-table.html index 2a1ca3c3c..8add4c407 100644 --- a/Content/change-types/databricks/optimize-table.html +++ b/Content/change-types/databricks/optimize-table.html @@ -44,7 +44,7 @@

      Available The effectiveness of the locality decreases with each additional column.

      Optional @@ -102,7 +102,7 @@

      Examples

      Related links

      diff --git a/Project/TOCs/TOC.fltoc b/Project/TOCs/TOC.fltoc index 081c74416..65d1ae2ce 100644 --- a/Project/TOCs/TOC.fltoc +++ b/Project/TOCs/TOC.fltoc @@ -1302,9 +1302,6 @@ -