Conversation
| } | ||
|
|
||
| rangeSQL := fmt.Sprintf( | ||
| "SELECT min(%[1]s) as `min`, max(%[1]s) as `max`, %[2]s as `watermark` FROM %[3]s %[4]s", |
There was a problem hiding this comment.
This is not an efficient query even when running on partition column
There was a problem hiding this comment.
An optimization can be done where we check if this is the partition column in the table and directly check on min/max partition metadata.
Given this is an often executed query I think it can done in a follow-up. @begelundmuller thoughts ?
There was a problem hiding this comment.
If the optimization can be done in a fast/cheap/safe way, then yeah it sounds good to me
There was a problem hiding this comment.
Can be fast but to ensure that we do not query information_schema again and again, we need to cache the information that this is the partition column in the table so require some changes. Will take it up separately .
| @@ -180,33 +181,157 @@ func (q *TableHead) generalExport(ctx context.Context, rt *runtime.Runtime, inst | |||
| } | |||
|
|
|||
| func (q *TableHead) buildTableHeadSQL(ctx context.Context, olap drivers.OLAPStore) (string, error) { | |||
There was a problem hiding this comment.
It seems like there's a huge complexity increase in this function. Two questions:
- We don't run
TableHeadvery often, so is it necessary to optimize it so hard? In general, I would assume people who connect a BI tool to a data warehouse are fine with aSELECT * FROM tbl LIMIT 100query being run. - If it really is necessary, is it possible to combine it into one nested query and push it into the dialect somehow?
There was a problem hiding this comment.
- It is used in data preview. On a 100 TB table this can cost a user 600 dollars. This can be a silent "trap" for a user given BigQuery returns result very fast (as reported by users running such queries on big tables).
I agree that users should not use bytes processed based pricing when connecting to a BI tool but we should not leave such traps for users.
For example, I found this issue in superset where the reporter refused to use superset with BigQuery till this kind of queries are removed : Select * Limit is DANGEROUS in BigQuery apache/superset#17299 - For partition pruning the filter has to be a static filter and using dynamic filter is not allowed.
If you are worried about dialect specific complexity in runtime/queries then we can take one of the following approaches:
- Disable data preview for BigQuery in UI and return an error in the API.
- Use preview table API which is free : https://docs.cloud.google.com/bigquery/docs/samples/bigquery-browse-table#bigquery_browse_table-go
Both approaches make this more optimised given we don't have to scan even 1 partition (which can still be big).
There was a problem hiding this comment.
That makes sense. Yeah I'm just a little worried about the driver-specificity in TableHead, especially given we are not adding many new OLAP drivers.
I don't think we should disable previews, but it would just be nice if we could push this into the driver somehow. I'm good with any of these:
- Rewrite
SELECT * FROM tbl LIMIT ninto preview API calls insideOLAPStore.Queryitself (similar to the code we have here:)rill/runtime/drivers/bigquery/warehouse.go
Lines 38 to 39 in 93b278f
- Add a
Headfunction on theOLAPStoreinterface (other drives can implement using a normalSELECT *) - Add to the
drivers.Dialectsomehow (will become clean with Naman's refactors)
There was a problem hiding this comment.
I implemented 2nd option. It leads to some duplicate code but seemed cleanest/safest.
closes https://linear.app/rilldata/issue/PLAT-450/metrics-views-on-bigquery
Added
TODOs to be done with follow ups:
civil.Datetotime.Timein the rill driver and handle it wherever requiredChecklist: