Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: simplify generated select distinct expressions #9905

Closed
1 task done
corleyma opened this issue Aug 22, 2024 · 1 comment · Fixed by #9923
Closed
1 task done

feat: simplify generated select distinct expressions #9905

corleyma opened this issue Aug 22, 2024 · 1 comment · Fixed by #9923
Assignees
Labels
feature Features or general enhancements
Milestone

Comments

@corleyma
Copy link

Is your feature request related to a problem?

Originally discussed here.

When compiling expressions that use .distinct() to SQL, it seems like ibis always generates a subquery selection.

E.g., consider the following ibis program.

import ibis

# mock tables
table_a = ibis.table(
    [
        ('id', 'int64'),
        ('event_id', 'int64'),
        ('event_code', 'string'),
        ('event_date', 'date'),
        ('event_timestamp', 'timestamp'),
        ('event_type', 'int64'),
    ],
    name='table_a'
)

table_b = ibis.table(
    [
        ('id', 'int64'),
        ('location_id', 'int64'),
        ('external_id', 'string'),
    ],
    name='table_b'
)

# Create the expression
expr  = (
    table_a.left_join(table_b, table_a.id == table_b.id)
    .filter(table_a.event_type == 12345)
    .select(
        [
            table_a.id,
            table_b.location_id,
            table_b.external_id,
            table_a.event_id,
            table_a.event_code,
            table_a.event_date,
            table_a.event_timestamp,
        ]
    )
    .distinct()
)

ibis.to_sql(expr)

This will produce something like:

SELECT DISTINCT
  *
FROM (
  SELECT
    "t4"."id",
    "t4"."location_id",
    "t4"."external_id",
    "t4"."event_id",
    "t4"."event_code",
    "t4"."event_date",
    "t4"."event_timestamp"
  FROM (
    SELECT
      "t2"."id",
      "t2"."event_id",
      "t2"."event_code",
      "t2"."event_date",
      "t2"."event_timestamp",
      "t2"."event_type",
      "t3"."id" AS "id_right",
      "t3"."location_id",
      "t3"."external_id"
    FROM "table_a" AS "t2"
    LEFT OUTER JOIN "table_b" AS "t3"
      ON "t2"."id" = "t3"."id"
  ) AS "t4"
  WHERE
    "t4"."event_type" = 12345
) AS "t5"

but ideally we'd get something like:

SELECT DISTINCT
  "t4"."id",
  "t4"."location_id",
  "t4"."external_id",
  "t4"."event_id",
  "t4"."event_code",
  "t4"."event_date",
  "t4"."event_timestamp"
FROM (
  SELECT
    "t2"."id",
    "t2"."event_id",
    "t2"."event_code",
    "t2"."event_date",
    "t2"."event_timestamp",
    "t2"."event_type",
    "t3"."id" AS "id_right",
    "t3"."location_id",
    "t3"."external_id"
  FROM "table_a" AS "t2"
  LEFT OUTER JOIN "table_b" AS "t3"
    ON "t2"."id" = "t3"."id"
) AS "t4"
WHERE
  "t4"."event_type" = 12345
AS "t5"

What is the motivation behind your request?

Analysts say the SQL generated by ibis is sometimes difficult to read, which can make it difficult to debug ibis logic and, importantly, also makes it difficult to communicate about what ibis is doing to audiences that primarily understand SQL.

This example with distinct() is one that came up frequently in my discussions. So, the goal of this feature is primarily to improve the legibility of SQL generated by ibis to humans. It's also possible that it could also improve performance for some backends, though I imagine most would optimize this away when generating their internal execution plan.

Describe the solution you'd like

per @cpcloud here:

This would probably have to be an optimization in our expression rewriting system.

Alternatively, if simplifying distinct expressions generically is too complex, allowing a .select() expression to take a distinct=True argument and generating directly a SELECT DISTINCT would also be an acceptable solution.

What version of ibis are you running?

9.3.0

What backend(s) are you using, if any?

this problem is backend-independent

Code of Conduct

  • I agree to follow this project's Code of Conduct
@corleyma corleyma added the feature Features or general enhancements label Aug 22, 2024
@jcrist
Copy link
Member

jcrist commented Aug 22, 2024

Thanks for opening this! I have a branch somewhere that added this optimization, I'll try and push it up tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants