Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
201 changes: 201 additions & 0 deletions Database-Features/zone-map-usage-optimizer-question.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
# Why did engine decide not to use the zone map?

* even though zone map is forced and
* the WHERE clause can be used to prune data?

## Question

Even if you force zone map usage and our WHERE clause is prunable, engine might still avoid using the zone map:

We have given the following DDL:

```sql
-- data model preparation
CREATE SCHEMA TEST;

CREATE OR REPLACE TABLE TEST.T1 (time_zones timestamp, sid int);
ALTER TABLE TEST.T1 DISTRIBUTE BY sid;
```

We insert data as follows:

```sql
INSERT INTO
TEST.T1
SELECT
ADD_DAYS(TIMESTAMP '2025-06-30 00:00:00', - round(RANGE_VALUE mod 4*365,0)) as mydate
, RANGE_VALUE
FROM
VALUES BETWEEN 1 AND 1000000*200*NPROC()+1;
```

We also include the following data to better explain the engine’s decisions:

```sql
INSERT INTO TEST.T1 VALUES
('2024-01-01 00:00:00',105 ),
('2024-01-01 00:00:00',20717),
('2024-01-01 00:00:00',5),
('2025-04-18 00:00:00',111),
('2025-06-02 00:00:00',5),
('2025-06-03 00:00:00',105),
('2025-06-04 00:00:00',20717),
('2025-06-05 00:00:00',5);
```

Next we add zone map enforcement:

```sql
ENFORCE ZONEMAP ON TEST.T1 (time_zones);
```

and add an index to the SID, which in real use cases can be created by join queries:

```sql
ENFORCE LOCAL INDEX ON TEST.T1 (sid);
```

Let's run the following statement to see if the zonemap was used.

```sql
SELECT DISTINCT sid
FROM TEST.T1
WHERE time_zones between '2021-06-01 00:00:00' and '2025-06-06 00:00:00';
```

### Profiling of the execution plan without zone map usage

|PART_ID|PART_NAME|PART_INFO|OBJECT_SCHEMA|OBJECT_NAME|REMARKS|DURATION|
|---|---|---|---|---|---|---|
|1|COMPILE / EXECUTE| - | - | - | - |0.203|
|2|SCAN| - |TEST|T1| - |0.484|
|3|GROUP BY|on TEMPORARY table| - |tmp_subselect0| - |$${\color{red}0.491}$$|

Regarding [Exasol Zone maps documentation](https://docs.exasol.com/db/latest/performance/zonemaps.htm) in profiling, an operation that utilized the zone records will show WITH ZONEMAP in the PART_INFO field. Why we do not see zone map usage?

## Explanation

The engine computes the data based on the execution plan created by the compiler. During filter stages, the Engine decides on the use of zone maps.

> [!IMPORTANT]
> The engine may decide not to use the zone map because it estimates that another access path or index is more efficient.

Let's break down why this can happen in Exasol:

Let us do the following test, and lets remove the index:

```sql
DROP LOCAL INDEX ON TEST.T1 (sid);
```

> [!CAUTION]
> Only perform such manual indexing operations based on the advice and guidance of Exasol Support.

We run the query again:

```sql
SELECT DISTINCT sid
FROM TEST.T1
WHERE time_zones between '2021-06-01 00:00:00' and '2025-06-06 00:00:00';
```

### Profiling of the execution plan with zone map usage

|PART_ID|PART_NAME|PART_INFO|OBJECT_SCHEMA|OBJECT_NAME|REMARKS|DURATION|
|---|---|---|---|---|---|---|
|1|COMPILE / EXECUTE| - | - | - | - |0.667|
|2|SCAN|$${\color{red}WITH \space ZONEMAP}$$|TEST|T1|T1(TIME_ZONES)|0.117|
|3|GROUP BY|on TEMPORARY table| - |tmp_subselect0|T1(SID)|$${\color{red}5.271}$$|

> [!NOTE]
> From this example we see that this execution speeds up the SCAN from 0.484 sec to 0.117 sec but slows down the GROUP BY from 0.491 sec to 5.271 sec.

In order to explain the problem sufficiently and simply, we have simplified the execution somewhat and use the additional data sets.

### Explanation of the execution plan without zone map usage

Let us go the first execution:

The SCAN uses the index on SID to identify the best ROWID order for the GROUP BY. So the data block layout after
the SCAN is:

|ROWID|TIMES_SEGMENT_ID|TIMES_VALUE|SID_VALUE|
|---|---|---|---|
|46|1|'2020-01-01'|5|
|65|6|'2025-06-05'|5|
|7348|5|'2025-06-02'|5|
|23|1|'2020-01-01'|105|
|3|5|'2025-06-03'|105|
|1234|5|'2025-04-18'|111|
|12|1|'2020-01-01'|20717|
|128|5|'2025-06-04'|20717|

The input is ordered as above. This means, engine reads the first entries and sees TIMES_SEGMENT_ID 1, 6, 5, 1, ...
Accessing the metadata of a segment has an overhead and also consumes memory. So our engine only accesses the metadata if consecutive rows in the data block have the same SEGMENT_ID.

> [!NOTE]
> Due to that mixed up TIMES_SEGMENT_IDs, our database engine does not use zone maps for the SCAN although zone maps are available.

Now let us continue with the example.
After the PIPE FILTER filter stage down the data block layout is:

|ROWID|TIMES_SEGMENT_ID|TIMES_VALUE|SID|
|---|---|---|---|
|65|6|'2025-06-05'|$${\color{green}5}$$|
|7348|5|'2025-06-02'|$${\color{green}5}$$|
|3|5|'2025-06-03'|105|
|128|5|'2025-06-04'|20717|

> [!IMPORTANT]
> This is the perfect input for the GROUP BY and considerable speeds up the following GROUP BY computation.

$\textsf{\color{green}{Reason: GROUP BY execution is faster if entries with the same key are executed together.}}$

### Explanation of the execution plan with zone map usage

Here the SCAN scans the column TIMES and executes the filter. The scan part leads to the following data block layout:

|ROWID|TIMES_SEGMENT_ID|TIMES_VALUE|SID_VALUE|
|---|---|---|---|
|23|1|'2020-01-01'|105|
|12|1|'2020-01-01'|20717|
|46|1|'2020-01-01'|5|
|1234|5|'2025-04-18'|111|
|7348|5|'2025-06-02'|5|
|3|5|'2025-06-03'|105|
|128|5|'2025-06-04'|20717|
|65|6|'2025-06-05'|5|

Let us assume the zone maps for the TIMES_SEGEMENT_IDs (1,2,3,4) have min = '2020-01-01' and max = '2020-01-01'.
This means the filter part of the scan can remove the segments 1,2,3,4 with zone maps and only needs to filter the rest.
Thus the data block layout for the next stage after the filter part of FILTERSCAN is

|ROWID|TIMES_SEGMENT_ID|TIMES_VALUE|SID|
|---|---|---|---|
|7348|5|'2025-06-02'|$${\color{red}5}$$|
|3|5|'2025-06-03'|105|
|128|5|'2025-06-04'|20717|
|65|6|'2025-06-05'|$${\color{red}5}$$|

Now we do a GROUP BY POSTPROCESSING. Here the issue starts because GROUP BY is a time consuming
operation because the data is not already correctly pre-sorted by the scan.

> [!IMPORTANT]
> This is not the perfect input for the GROUP BY.

$\textsf{\color{red}{Reason: GROUP BY execution is slower if entries with the same key are not executed together.}}$

> [!NOTE]
> In general, GROUP BY is more time consuming than a SCAN.

## References

* [Exasol Zone Maps documentation](https://docs.exasol.com/db/latest/performance/zonemaps.htm)
* [Profiling](https://docs.exasol.com/db/latest/database_concepts/profiling.htm)

## Authors

* [Peggy Schmidt-Mittenzwei](https://github.com/PeggySchmidtMittenzwei)
* [Georg Dotzler](https://github.com/narmion)

*We appreciate your input! Share your knowledge by contributing to the Knowledge Base directly in [GitHub](https://github.com/exasol/public-knowledgebase).*