Snowflake supports the following types of joins:
- inner join (only return matched rows)
- outer join (= left/right/full-outer join. return all rows, including those without a match)
- cross join (return Cartesian product, rarely used)
- natural join (= inner join without on clause, so it implicitly use same-named cols from two tables, can be risky to use)
If you have UNIQUE, PRIMARY KEY, FOREIGN KEY constraints defined on tables, and your data does comply with these constraints, then if you specify rely
on these constraints, sf can use them to eliminate unnecessary joins (where the data in the joined table is not actually used in what you select in the SELECT, or, even if used, could get the same result without a join).
A subquery is a query within another query.
The only subquery that allows a LIMIT/FETCH clause is an uncorrelated scalar subquery. But, because an uncorrelated scalar subquery returns only 1 row, the LIMIT clause is basically meaningless.
Many types of data are best represented as a hierarchy, such as an organization tree. Employees are usually organized in a hierarchy, with a company President at the top of the hierarchy. Another example of a hierarchy is a "parts explosion". For example, a car contains an engine; an engine contains a fuel pump; and a fuel pump contains a hose.
You can store hierarchical data in:
- A hierarchy of tables.
- A single table with column(s) representing the hierarchy,e.g. indicating each employee's direct manager.
- semi-structured data
If the number of levels is unknown/changing, so that it is not possible to create a hierarchy with a known number of tables, then the hierarchical data can be stored in one table. But if the data at different levels doesn't share the same record structure, then storing all in one table might not be practical.
Ways to query them in relational table(s):
- if num of levels known, use joins
- if num of levels unknown, use recursive CTE or CONNECT BY
CONNECT BY only allow for self-joins. Recursive CTE is more flexible, allows any joins.
You can use recursive CTEs and CONNECT BY on a single table that contains multiple trees, but you can only query one tree at a time, and that tree must be contiguous.
Common table expressions, a named subquery defined in a WITH clause. CTEs increase modularity and simplify maintenance.
If CTE has same name with a db object, then the CTE name takes precedence. But try not to use duplicate names to avoid confusion.
Constructing a recursive CTE incorrectly can cause an infinite loop.
read, skipped. Queries see Chapter 0.
To identify sequences of rows that match a specific pattern, use the MATCH_RECOGNIZE subclause of the FROM clause.
You cannot use the MATCH_RECOGNIZE clause in a recursive common table expression (CTE).
Used to generate unique numbers across sessions and statements, including concurrent statements. They can be used to generate values for a primary key or any column that requires a unique value.
- The generated numbers consistently increase in value, or decrease, but are not necessarily contiguous.
- All values generated by a sequence are globally unique as long as the sign of the step does not change. Concurrent queries never observe the same value, and values within a single query are always distinct.
- Snowflake may calculate the next value for a sequence as soon as the current sequence number is used, rather than waiting until the next sequence number is requested. So an
ALTER SEQUENCE ... SET INCREMENT ...
command may not affect the next operation that uses it.
When a query is executed, the result is cached for 24hrs (result cache).
This can be used for retrieval optimization to avoid re-generating query results when query & the original data did not change. You can also use this cache to do post-processing.
Query results are reused if all of the following conditions are met:
- The new query exactly matches the old query, in terms of syntax.
- The query does not include functions like CURRENT_TIMESTAMP() etc. CURRENT_DATE() works though...
- The query does not include UDFs or external functions.
- The underlying table data contributing to the query result has not changed.
- The result cache of the old query is still there.
- The role accessing the cache has the required privileges. For a SELECT query, the role must have the necessary access privileges for all the underlying tables. For a SHOW query, the role must match the role that generated the cache.
- Any config options that affect the results have not changed.
- The table's micro-partitions have not changed (not reclustered/consolidated).
By default, result cache is enabled, but can be overridden at the account/user/session level using the USE_CACHED_RESULT
session param.
Post-processing Query Results use cases:
- You are developing a complex query step-by-step, you want to add a new layer on top of the previous query and run the new query without recalculating the prev results.
- The previous query was a SHOW/DESCRIBE/CALL statement, or a SP, which returns results in a form that are not easy to reuse.
Post-processing can be performed using the RESULT_SCAN
table function, which returns the results of a previous query as a "table".
Ways to get the num of rows with distinct values:
- count(distinct ...)
- HyperLogLog functions such as approx_count_distinct for large inputs, can do trillions+ of rows
- for hierarchical aggregations, you can improve performance by bitmaps or producing arrays with distinct vals.
MinHash for efficiently estimating the similarity between two or more data sets.
The Space-Saving Algorithm is an efficient way of estimating frequent values in data sets.
Aggregate functions include approx_top_k, etc.
The t-Digest algorithm is an efficient way of estimating percentile values in data sets. Aggregate functions include approx_percentile, etc.
Provides execution details for a query - a graphical representation of the main components of the processing plan, statistics for each component, details and statistics for the overall query.
Helps to understand the performance/behavior of a particular query - spot mistakes in the query, identify performance bottlenecks, and improvement opportunities.
Operators are the functional building blocks of a query. Operator types:
- data access and generation operators: TableScan, ValuesClause, Generator, ExternalScan, InternalObject.
- data processing operators: Filter, Join, Aggregate, GroupingSets, WindowFunction, Sort, SortWithLimit, Flatten, JoinFilter, UnionAll, ExternalFunction.
- DML operators: Insert, Delete, Update, Merge, Unload
- metadata operators: DDL and Transaction Commands, Table Creation Command, Query Result Reuse, Metadata-based Result
- miscellaneous operators: Result
Execution time can be broken down into:
- Processing: time spent on data processing by the CPU.
- Local Disk IO: time when the processing was blocked by local disk access.
- Remote Disk IO: time when the processing was blocked by remote disk access.
- Network Communication: time when the processing was waiting for the network data transfer.
- Synchronization: various synchronization activities between participating processes.
- Initialization: time spent setting up the query processing.
Statistics:
- IO:
- Scan progress: % of data scanned for a given table
- Bytes scanned
- Percentage scanned from cache: from the local disk cache
- Bytes written
- Bytes written to result
- Bytes read from result
- External bytes scanned: bytes read from an external object, e.g. a stage
- DML:
- Number of rows inserted/updated/deleted/unloaded
- Pruning:
- Partitions scanned
- Partitions total: total num of partitions in the table
- Spilling: disk usage where intermediate results do not fit in memory
- Bytes spilled to local storage: to local disk
- Bytes spilled to remote storage: to remote disk
- Network: network communication
- Bytes sent over the network
- External Functions: calls to external functions
Common query problems identified by query profile:
- exploding joins: did not provide join condition, resulting in a cartesian product
- UNION without ALL: UNION does de-dup. Use UNION ALL when do not need to de-dup
- queries too large to fit in memory: See data spilling to local disk then remote disk. Need to use a larger wh, or, process the data in smaller batches.
- inefficient pruning: "partitions scanned" almost equal to "partitions total".
Recommend to cancel a statement using either:
- the interface of the app in which the query is running (e.g. Worksheet in the SnowSight)
- or, the cancellation API of the Snowflake ODBC/JDBC driver
However, sometimes using sql is necessary.
In query history, some queries do not have warehouse size, this is because they used cache, and never required a warehouse. If a client shows Go, it means this query started from SnowSight, because SnowSight was written in the Go language. If a query was initiated in the SnowSQL, it will show Python, because SnowSQL is written in python. If from Tableau, it will show JDBC.
Query tag - if you are from IT team, and need to debug/optimize queries, you can tag these queries using alter session, and put ticket/Jira number in it, so the prod department know where these additional compute cost came from.