Skip to content

CLI: add tables/find commands#4075

Open
MonkeyCanCode wants to merge 18 commits intoapache:mainfrom
MonkeyCanCode:cli_summary_subcommand_v2
Open

CLI: add tables/find commands#4075
MonkeyCanCode wants to merge 18 commits intoapache:mainfrom
MonkeyCanCode:cli_summary_subcommand_v2

Conversation

@MonkeyCanCode
Copy link
Copy Markdown
Contributor

@MonkeyCanCode MonkeyCanCode commented Mar 28, 2026

This is phase two of CLI: Add summarize subcommand, with great feedback from @flyrain and community from ML, this PR added the following support:

  1. find command to locate identifier via fuzzy search
  2. tables command to handle some basic Iceberg table operation (get/list/summarize/non-purge delete)

Also, a newline is added per section for summarize sub-commands introduced from phase one for easier readability.

While working on this, I noticed our test suits for CLI is a bit messy. I will create a follow up PR to clean up those and add missing one (currently the missing test cases are been tracked via #4017)

Here are couple sample output:

Find command

# fuzzy search for all entities across all catalogs
➜  polaris git:(cli_summary_subcommand_v2) ./polaris --profile dev find user
Searching for 'user'...
[Global]
  Principal:           quickstart_user
  Principal:           readonly_user
  Principal:           dev_user
  Principal Role:      quickstart_user_role
  Principal Role:      readonly_user_role
  Principal Role:      dev_user_role

[Catalog: quickstart_catalog]
  Table:               dev_namespace.sub_namespace.user
  View:                dev_namespace.sub_namespace.user_view

Found 8 matches (3 Principals, 3 Principal Roles, 1 Table, 1 View).

# fuzzy search for all entities within a single catalog
➜  polaris git:(cli_summary_subcommand_v2) ./polaris --profile dev find dev --catalog quickstart_catalog
Searching for 'dev'...
[Catalog: quickstart_catalog]
  Catalog Role:        dev_catalog_role
  Namespace:           dev_namespace

Found 2 matches (1 Catalog Role, 1 Namespace).

# fuzzy search for entity catalog role within a single catalog
➜  polaris git:(cli_summary_subcommand_v2) ./polaris --profile dev find dev --catalog quickstart_catalog --type catalog-role
Searching for 'dev'...
[Catalog: quickstart_catalog]
  Catalog Role:        dev_catalog_role

Found 1 matches (1 Catalog Role).

Tables command

# list tables
➜  polaris git:(cli_summary_subcommand_v2) ✗ ./polaris --profile dev tables list --catalog quickstart_catalog --namespace dev_namespace.sub_namespace
{"namespace": ["dev_namespace", "sub_namespace"], "name": "user"}

# get full table metadata
➜  polaris git:(cli_summary_subcommand_v2) ✗ ./polaris --profile dev tables get user --catalog quickstart_catalog --namespace dev_namespace.sub_namespace
{"metadata-location": "file:/var/tmp/quickstart_catalog/dev_namespace/sub_namespace/user/metadata/00002-fa1347d8-c14a-4af7-974d-2e80bc0a5866.metadata.json", "metadata": {"format-version": 3, "table-uuid": "35836a86-bf3a-43df-a6a4-ace9e5c8fb22", "location": "file:///var/tmp/quickstart_catalog/dev_namespace/sub_namespace/user", "last-updated-ms": 1774722865518, "next-row-id": 1, "properties": {"owner": "yong", "created-at": "2026-03-28T18:34:23.090216Z", "write.distribution-mode": "range", "write.parquet.compression-codec": "zstd"}, "schemas": [{"type": "struct", "fields": [{"id": 1, "name": "id", "type": "long", "required": true, "doc": "Row ID"}, {"id": 2, "name": "user", "type": {"type": "struct", "fields": [{"id": 9, "name": "user_id", "type": "string", "required": false}, {"id": 10, "name": "name", "type": "string", "required": false}, {"id": 11, "name": "address", "type": {"type": "struct", "fields": [{"id": 12, "name": "street", "type": "string", "required": false}, {"id": 13, "name": "city", "type": "string", "required": false}, {"id": 14, "name": "country", "type": "string", "required": false}]}, "required": false}]}, "required": true, "doc": "User info"}, {"id": 3, "name": "tags", "type": {"type": "list", "element-id": 15, "element": "string", "element-required": false}, "required": false, "doc": "tags"}, {"id": 4, "name": "attributes", "type": {"type": "map", "key-id": 16, "key": "string", "value-id": 17, "value": "string", "value-required": false}, "required": false, "doc": "User attributes"}, {"id": 5, "name": "events", "type": {"type": "list", "element-id": 18, "element": {"type": "struct", "fields": [{"id": 19, "name": "event_type", "type": "string", "required": false}, {"id": 20, "name": "event_time", "type": "timestamptz", "required": false}, {"id": 21, "name": "metadata", "type": {"type": "map", "key-id": 22, "key": "string", "value-id": 23, "value": "string", "value-required": false}, "required": false}]}, "element-required": false}, "required": false, "doc": "User event history"}, {"id": 6, "name": "event_data", "type": "variant", "required": false, "doc": "User event data"}, {"id": 7, "name": "category", "type": "string", "required": true, "doc": "Event category"}, {"id": 8, "name": "created_at", "type": "timestamptz", "required": true, "doc": "Event creation time"}]}], "current-schema-id": 0, "last-column-id": 23, "partition-specs": [{"fields": [{"field-id": 1000, "source-id": 8, "name": "created_at_day", "transform": "day"}, {"field-id": 1001, "source-id": 7, "name": "category", "transform": "identity"}]}], "default-spec-id": 0, "last-partition-id": 1001, "sort-orders": [{"fields": []}, {"fields": [{"source-id": 8, "transform": "identity", "direction": "desc", "null-order": "nulls-last"}, {"source-id": 1, "transform": "identity", "direction": "asc", "null-order": "nulls-first"}]}], "default-sort-order-id": 1, "snapshots": [{"snapshot-id": 201003753560339990, "sequence-number": 1, "timestamp-ms": 1774722865518, "manifest-list": "file:/var/tmp/quickstart_catalog/dev_namespace/sub_namespace/user/metadata/snap-201003753560339990-1-e0dcc235-e5a1-454a-a303-6a1c8fa22525.avro", "first-row-id": 0, "summary": {"operation": "append", "spark.app.id": "local-1774722859049", "added-data-files": "1", "added-records": "1", "added-files-size": "5600", "changed-partition-count": "1", "total-records": "1", "total-files-size": "5600", "total-data-files": "1", "total-delete-files": "0", "total-position-deletes": "0", "total-equality-deletes": "0", "engine-version": "4.0.2", "app-id": "local-1774722859049", "engine-name": "spark", "iceberg-version": "Apache Iceberg 1.10.1 (commit ccb8bc435062171e64bc8b7e5f56e6aed9c5b934)"}, "schema-id": 0}], "refs": {"main": {"type": "branch", "snapshot-id": 201003753560339990}}, "current-snapshot-id": 201003753560339990, "last-sequence-number": 1, "snapshot-log": [{"snapshot-id": 201003753560339990, "timestamp-ms": 1774722865518}], "metadata-log": [{"metadata-file": "file:/var/tmp/quickstart_catalog/dev_namespace/sub_namespace/user/metadata/00000-9cac3cd7-7dbd-4355-be3d-2d3da33d3158.metadata.json", "timestamp-ms": 1774722863092}, {"metadata-file": "file:/var/tmp/quickstart_catalog/dev_namespace/sub_namespace/user/metadata/00001-ef4623e9-286d-4859-9aa6-e90e968b8b12.metadata.json", "timestamp-ms": 1774722863221}], "statistics": [], "partition-statistics": []}}

# table summarize
➜  polaris git:(cli_summary_subcommand_v2) ✗ ./polaris --profile dev tables summarize user --catalog quickstart_catalog --namespace dev_namespace.sub_namespace
Table: dev_namespace.sub_namespace.user
--------------------------------------------------------------------------------
Metadata
  Location:                      file:///var/tmp/quickstart_catalog/dev_namespace/sub_namespace/user
  Format Version:                3
  Snapshots:                     1
  Current Snapshot ID:           201003753560339990
  Last Updated:                  2026-03-28 18:34:25 UTC

Statistics
  Total Records:                 1
  Total Data Files:              1
  Total Files Size:              5600

Schema
  +----+------------+-------------------------------------------------------------------------------------------------+----------+---------------------+
  | ID | Field Name | Type                                                                                            | Required | Comment             |
  +----+------------+-------------------------------------------------------------------------------------------------+----------+---------------------+
  | 1  | id         | long                                                                                            | *        | Row ID              |
  | 2  | user       | struct<user_id:string, name:string, address:struct<street:string, city:string, country:string>> | *        | User info           |
  | 3  | tags       | list<string>                                                                                    |          | tags                |
  | 4  | attributes | map<string, string>                                                                             |          | User attributes     |
  | 5  | events     | list<struct<event_type:string, event_time:timestamptz, metadata:map<string, string>>>           |          | User event history  |
  | 6  | event_data | variant                                                                                         |          | User event data     |
  | 7  | category   | string                                                                                          | *        | Event category      |
  | 8  | created_at | timestamptz                                                                                     | *        | Event creation time |
  +----+------------+-------------------------------------------------------------------------------------------------+----------+---------------------+

Partitioning
  +-----------+----------------+-----------+
  | Source ID | Field Name     | Transform |
  +-----------+----------------+-----------+
  | 8         | created_at_day | day       |
  | 7         | category       | identity  |
  +-----------+----------------+-----------+

Sort order
  +-----------+-----------+-------------+-----------+
  | Source ID | Transform | Null Order  | Direction |
  +-----------+-----------+-------------+-----------+
  | 8         | identity  | nulls-last  | desc      |
  | 1         | identity  | nulls-first | asc       |
  +-----------+-----------+-------------+-----------+

Effective policies
  - orphan-file-policy (Inherited from dev_namespace)
  - snapshot-expiry-policy (Inherited from dev_namespace)
--------------------------------------------------------------------------------

Setup instructions used for above

# setup
## boostrap 
./polaris --profile dev setup apply site/content/guides/assets/polaris/reference-setup-config.yaml

## create sample table with complex types and sort order etc.
CREATE TABLE IF NOT EXISTS dev_namespace.sub_namespace.user (
    id BIGINT NOT NULL COMMENT 'Row ID',
    user STRUCT<user_id: STRING, name: STRING, address: STRUCT<street: STRING, city: STRING, country: STRING>> NOT NULL COMMENT 'User info',
    tags ARRAY<STRING> COMMENT 'tags',
    attributes MAP<STRING, STRING> COMMENT 'User attributes',
    events ARRAY<STRUCT<event_type: STRING, event_time: TIMESTAMP, metadata: MAP<STRING, STRING>>> COMMENT 'User event history',
    event_data VARIANT COMMENT 'User event data',
    category STRING NOT NULL COMMENT 'Event category',
    created_at TIMESTAMP NOT NULL COMMENT 'Event creation time'
)
USING iceberg
PARTITIONED BY (days(created_at), category)
TBLPROPERTIES ('format-version' = '3');

ALTER TABLE dev_namespace.sub_namespace.user WRITE ORDERED BY (created_at DESC, id);

INSERT INTO dev_namespace.sub_namespace.user VALUES (
  1,
  named_struct(
    'user_id', 'u1',
    'name', 'xxx',
    'address', named_struct('street', 'xxx', 'city', 'xxx', 'country', 'xx')
  ),
  array('tag1', 'tag2'),
  map('key1', 'value1'),
  array(
    named_struct(
      'event_type', 'x',
      'event_time', timestamp '2026-03-24 12:00:00',
      'metadata', map('k', 'v')
    )
  ),
  parse_json('{"dynamic_field": 123, "nested": {"a": true}}'),
  'xxx',
  timestamp '2026-03-24 12:00:00'
);

CREATE VIEW IF NOT EXISTS dev_namespace.sub_namespace.user_view AS SELECT * FROM dev_namespace.sub_namespace.user;

Checklist

  • 🛡️ Don't disclose security issues! (contact security@apache.org)
  • 🔗 Clearly explained why the changes are needed, or linked related issues: Fixes #
  • 🧪 Added/updated tests with good coverage, or manually tested (and explained how)
  • 💡 Added comments for complex logic
  • 🧾 Updated CHANGELOG.md (if needed)
  • 📚 Updated documentation in site/content/in-dev/unreleased (if needed)

Copy link
Copy Markdown
Contributor

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice new tools, @MonkeyCanCode 👍

Just one comment about matching logic 😅

# Subsequence match: enabled for length > 2
if query_len > 2:
iterator = iter(t)
if all(char in iterator for char in q):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will match q: max, t: mixed bag of exceptions, right? Is that intended?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Similar to fuzzy search, we don't know the total length. So users can reduce the search result by providing more characters. I put 3 characters minimal before fuzzy search to avoid user typed 'a' then it returns everything contains letter 'a'.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this logic if it works for you :) just wanted to make sure the behaviour was intentional :)

Copy link
Copy Markdown
Contributor Author

@MonkeyCanCode MonkeyCanCode Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am more than happy to take feedback on how to better handle this and if min of 4 characters is too verbose to trigger a fuzzy search. This requirement is from @flyrain , any thoughts on this route?

Copy link
Copy Markdown
Contributor

@dimas-b dimas-b Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I'm not sure if there are any (realistic) cases that will get a match by this rule, but not get a match by the SequenceMatcher (below) 🤔 Do you have any examples like that?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added to avoid FP noise. For example, if we allow SequenceMatcher on any character lengths, a single letter a will match anything contains letter a.
Thus, what I thought was following:

  • len 1: only exact or prefix match
  • len 2: add substring match (q in t)
  • len 3: add subsequence match
  • len 4+: similarity ratio check via SequenceMatcher

When I was testing this earlier with setup setup, allow similarity search on len 3 is too noise. Thus, I added subsequence match here instead. But it is not really necessary if a bit noise output is acceptable.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my personal opinion, matching max to mixed bag of exceptions (the subsequence rule) is noise too 😅 TBH, I do not see "logic" behind this rule 😅

I'd use SequenceMatcher immediately if exact substring matches do not yield True, but use different thresholds depending on the query string size to reduce noise.

However, like I said, I do not feel strongly about this.

@github-project-automation github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants