Skip to content

logicoffee/aqa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

analytics-query-analyzer

Analyze analytics SQL to extract referenced tables/columns and time bounds.

Support

BigQuery and Redshift are currently supported.

Install

Base install:

pip install analytics-query-analyzer

BigQuery extras (only needed for build_schema):

pip install analytics-query-analyzer[bigquery]

Redshift extras (only needed for build_schema):

pip install analytics-query-analyzer[redshift]

Usage

analyze

Extract table/column references from a query.

Schema format follows sqlglot conventions, with nested fields represented as nested dicts.

from analytics_query_analyzer import analyze
from sqlglot import dialects

schema = {
    "production": {
        "shop": {
            "orders": {
                "id": "int64",
                "ordered_at": "datetime",
                "user_id": "int64",
                "payment": {
                    "amount": "int64",
                    "method": "string",
                },
            },
        },
    },
}

sql = """
select
    id,
    user_id,
    payment.amount
from
    shop.orders
"""

references = analyze(dialects.BigQuery, sql, schema, "production")
print(references)
# [
#   {
#     "database": "production",
#     "schema": "shop",
#     "table": "orders",
#     "column": "id",
#   },
#   {
#     "database": "production",
#     "schema": "shop",
#     "table": "orders",
#     "column": "payment.amount",
#   },
#   {
#     "database": "production",
#     "schema": "shop",
#     "table": "orders",
#     "column": "user_id",
#   },
# ]

analyze with time bounds

Extract time bounds from filters by enabling with_timebounds.

from analytics_query_analyzer import analyze
from sqlglot import dialects

schema = {
    "production": {
        "shop": {
            "orders": {
                "id": "int64",
                "ordered_at": "datetime",
                "user_id": "int64",
            },
        },
    },
}

sql = """
select
    *
from
    shop.orders
where
    ordered_at >= "2025-01-01"
    and ordered_at < "2026-01-01"
"""

timebounds = analyze(
    dialects.BigQuery,
    sql,
    schema,
    "production",
    with_timebounds=True,
)
print(timebounds)
# [
#   {
#     "database": "production",
#     "schema": "shop",
#     "table": "orders",
#     "column": "id",
#     "lower": None,
#     "upper": None,
#   },
#   {
#     "database": "production",
#     "schema": "shop",
#     "table": "orders",
#     "column": "ordered_at",
#     "lower": "2025-01-01",
#     "upper": "2026-01-01",
#   },
#   {
#     "database": "production",
#     "schema": "shop",
#     "table": "orders",
#     "column": "user_id",
#     "lower": None,
#     "upper": None,
#   },
# ]

To make current_date() deterministic, pass a provider:

timebounds = analyze(
    dialects.BigQuery,
    "select * from shop.orders where ordered_at >= current_date()",
    schema,
    "production",
    with_timebounds=True,
    current_date_provider=lambda: "2026-01-01",
)

build_schema

Fetch a schema dictionary from BigQuery or Redshift.

from analytics_query_analyzer import build_schema
from sqlglot import dialects

schema = build_schema(dialects.BigQuery, "my_project", "my_schema", "my_table")
print(schema)
  • Authentication uses Application Default Credentials (ADC).
  • When table is omitted, it scans all tables in the dataset.
  • When both dataset and table are omitted, it scans all datasets in the project.
  • The returned schema can be passed directly to analyze.

Fetching from Redshift is also supported:

from analytics_query_analyzer import build_schema
from sqlglot import dialects

schema = build_schema(dialects.Redshift, "my_database", "public", "orders")
print(schema)

Redshift authentication supports two modes:

  • Set REDSHIFT_HOST, REDSHIFT_USER, and REDSHIFT_PASSWORD (optional REDSHIFT_PORT).
  • Set REDSHIFT_CLUSTER_IDENTIFIER, REDSHIFT_REGION, and REDSHIFT_DB_USER to use IAM (you can also set REDSHIFT_HOST/REDSHIFT_PORT).
  • Use AWS_PROFILE if you want to select a non-default AWS profile.

About

Analytics Query Analyzer

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages