Discuss offline/batch optimizing scenarios using Spark or Flink #1868

majin1102 · 2023-08-22T02:48:50Z

majin1102
Aug 22, 2023
Collaborator

Hi community,

I'm writing the proposal of introducing offline optimizing for Amoro and I'd love to discuss related scenarios in community before I complete the doc, anyone who has a need could join this discussion or get involved in the design and implementation.

There are three scenarios that drive me introduce offline optimizing in Amoro:

The first time an Iceberg table is loaded, especially after writing to an Iceberg V2 table for many days with FlinkCDC, it triggers a very long optimizing process and occupies a large amount of resources in the optimizer group. In the short term, the quota occupancy of this table will surge to an unreasonable value, causing other tables to postpone tasks due to resource competition. Introducing independent Offline optimizing here can avoid occupying the resources of the optimizer group and break through the quota limit in a short time.
In the Mixed Hive format, Hive snapshots are mainly promoted through Full optimizing, which has certain timeliness requirements. The resources limited by quota may not be sufficient to complete full optimizing within the expected time. the scenario is also applicable to TAG every single day in Iceberg table, more discussion points to(I don't like this discussion title) Support Time Travel of Arctic Mixed Hive Table #833
For any format table, we do not expect a batch optimizing task to conflict with a continuous optimizing task (inline/online). Although Iceberg has the ability to detect conflicts, it will cause unexpected retries and waste of resources. In addition, we hope that the batch-executed optimizing tasks and ongoing tasks can be naturally connected. The ideal way is to centrally manage them in a system like Amoro.

Aireed · 2023-08-24T07:03:52Z

Aireed
Aug 24, 2023
Collaborator

1 spark. In the customer scenario (mixed hive format), it is essential to ingest data into the ODS layer. To ensure data timeliness, downstream tasks that rely on the ODS layer need to read data through MOR. If a Spark node is introduced in the workflow for optimization, downstream tasks can directly read data as regular Hive tables, saving repeated MOR resource consumption while ensuring data consistency for multiple tasks (avoiding MOR reading different data due to task scheduling delays).

0 replies

baiyangtx · 2023-08-25T09:28:16Z

baiyangtx
Aug 25, 2023
Collaborator

In Amoro's current optimization design, Full Optimizing is a relatively special type that takes up a long execution time and contradicts the goal of maintaining good query performance even during continuous writes through continuous optimization in the current design.

I believe the essence of the optimizing scheduling problem is resources. Minor/Major Optimization requires continuous resource occupation, and its required resources are predictable. Full Optimization periodically occupies a large number of resources, with high resource requirements during execution and low execution frequency. Using the resource configuration of Minor/Major to execute Full may result in a long execution time for Full, and if Full exceeds the quota limit and occupies more resources during execution, other table's optimization will be affected.

Therefore, Full Optimization needs to be separated from the current Optimization Group, and the solutions are:

Use an independent Optimization Group.
Dynamically expand resources and have AMS apply for resources to execute the FULL type Optimization before execution.

The disadvantage of Solution 1 is that when Full Optimization is not executed, these resources are idle.
The disadvantage of Solution 2 is that it is not suitable for the most commonly used external optimizing container currently used.

0 replies

shidayang · 2023-08-25T10:05:48Z

shidayang
Aug 25, 2023
Collaborator

My understanding is that there may be two issues here. Can we discuss them separately?
The first is to better utilize resources by temporarily launching an optimizer with a large amount of resources to perform the merge, and then releasing the optimizer after the merge is complete. I think this can be achieved with either our current optimizer service or by introducing Spark.
The second issue is to introduce Spark to perform the merge. This might be considered to better reuse Iceberg's merge code and be more user-friendly for non-Spark users.

0 replies

HuangFru · 2023-08-28T02:35:10Z

HuangFru
Aug 28, 2023
Collaborator

Meanwhile, employing different sorting rules for different scenarios can effectively improve query performance. There is a need for sorting data in order to improve query performance. In the context of full optimizing, utilizing a batch way to achieve optimizing and data sorting simultaneously is a good solution. Relevant discussion regarding this issue can be found here: #1360.

Data sorting, as an operation, also involves rewriting files and can be accomplished together with optimizing in batch mode. This approach aligns with the characteristics of low execution frequency and high resource consumption. We should discuss it together.

0 replies

huyuanfeng2018 · 2023-09-04T12:53:53Z

huyuanfeng2018
Sep 4, 2023
Collaborator

Should the question of when the full task be invoked also be included in the discussion? It may be necessary to provide scheduling, or even manually trigger

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discuss offline/batch optimizing scenarios using Spark or Flink #1868

{{title}}

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Discuss offline/batch optimizing scenarios using Spark or Flink #1868

majin1102 Aug 22, 2023 Collaborator

Replies: 5 comments

Aireed Aug 24, 2023 Collaborator

baiyangtx Aug 25, 2023 Collaborator

shidayang Aug 25, 2023 Collaborator

HuangFru Aug 28, 2023 Collaborator

huyuanfeng2018 Sep 4, 2023 Collaborator

majin1102
Aug 22, 2023
Collaborator

Aireed
Aug 24, 2023
Collaborator

baiyangtx
Aug 25, 2023
Collaborator

shidayang
Aug 25, 2023
Collaborator

HuangFru
Aug 28, 2023
Collaborator

huyuanfeng2018
Sep 4, 2023
Collaborator