Skip to content

feat: Bigquery as OLAP engine#9161

Open
k-anshul wants to merge 16 commits intomainfrom
bigquery_olap
Open

feat: Bigquery as OLAP engine#9161
k-anshul wants to merge 16 commits intomainfrom
bigquery_olap

Conversation

@k-anshul
Copy link
Copy Markdown
Member

@k-anshul k-anshul commented Apr 1, 2026

closes https://linear.app/rilldata/issue/PLAT-450/metrics-views-on-bigquery

Added

TODOs to be done with follow ups:

  • Exports are broken
  • remove conversion of civil.Date to time.Time in the rill driver and handle it wherever required

Checklist:

  • Covered by tests
  • Ran it and it works as intended
  • Reviewed the diff before requesting a review
  • Checked for unhandled edge cases
  • Linked the issues it closes
  • Checked if the docs need to be updated. If so, create a separate Linear DOCS issue
  • Intend to cherry-pick into the release branch
  • I'm proud of this work!

@k-anshul k-anshul self-assigned this Apr 1, 2026
}

rangeSQL := fmt.Sprintf(
"SELECT min(%[1]s) as `min`, max(%[1]s) as `max`, %[2]s as `watermark` FROM %[3]s %[4]s",
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not an efficient query even when running on partition column

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An optimization can be done where we check if this is the partition column in the table and directly check on min/max partition metadata.
Given this is an often executed query I think it can done in a follow-up. @begelundmuller thoughts ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the optimization can be done in a fast/cheap/safe way, then yeah it sounds good to me

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be fast but to ensure that we do not query information_schema again and again, we need to cache the information that this is the partition column in the table so require some changes. Will take it up separately .

@k-anshul k-anshul requested a review from begelundmuller April 2, 2026 13:08
@@ -180,33 +181,157 @@ func (q *TableHead) generalExport(ctx context.Context, rt *runtime.Runtime, inst
}

func (q *TableHead) buildTableHeadSQL(ctx context.Context, olap drivers.OLAPStore) (string, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there's a huge complexity increase in this function. Two questions:

  1. We don't run TableHead very often, so is it necessary to optimize it so hard? In general, I would assume people who connect a BI tool to a data warehouse are fine with a SELECT * FROM tbl LIMIT 100 query being run.
  2. If it really is necessary, is it possible to combine it into one nested query and push it into the dialect somehow?

Copy link
Copy Markdown
Member Author

@k-anshul k-anshul Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It is used in data preview. On a 100 TB table this can cost a user 600 dollars. This can be a silent "trap" for a user given BigQuery returns result very fast (as reported by users running such queries on big tables).
    I agree that users should not use bytes processed based pricing when connecting to a BI tool but we should not leave such traps for users.
    For example, I found this issue in superset where the reporter refused to use superset with BigQuery till this kind of queries are removed : Select * Limit is DANGEROUS in BigQuery apache/superset#17299
  2. For partition pruning the filter has to be a static filter and using dynamic filter is not allowed.

If you are worried about dialect specific complexity in runtime/queries then we can take one of the following approaches:

  1. Disable data preview for BigQuery in UI and return an error in the API.
  2. Use preview table API which is free : https://docs.cloud.google.com/bigquery/docs/samples/bigquery-browse-table#bigquery_browse_table-go

Both approaches make this more optimised given we don't have to scan even 1 partition (which can still be big).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Yeah I'm just a little worried about the driver-specificity in TableHead, especially given we are not adding many new OLAP drivers.

I don't think we should disable previews, but it would just be nice if we could push this into the driver somehow. I'm good with any of these:

  1. Rewrite SELECT * FROM tbl LIMIT n into preview API calls inside OLAPStore.Query itself (similar to the code we have here:
    // Regex to parse BigQuery SELECT ALL statement: SELECT * FROM `project_id.dataset.table`
    var selectQueryRegex = regexp.MustCompile(
    )
  2. Add a Head function on the OLAPStore interface (other drives can implement using a normal SELECT *)
  3. Add to the drivers.Dialect somehow (will become clean with Naman's refactors)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented 2nd option. It leads to some duplicate code but seemed cleanest/safest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants