Replies: 5 comments
-
|
Beta Was this translation helpful? Give feedback.
-
In Amoro's current optimization design, Full Optimizing is a relatively special type that takes up a long execution time and contradicts the goal of maintaining good query performance even during continuous writes through continuous optimization in the current design. I believe the essence of the optimizing scheduling problem is resources. Minor/Major Optimization requires continuous resource occupation, and its required resources are predictable. Full Optimization periodically occupies a large number of resources, with high resource requirements during execution and low execution frequency. Using the resource configuration of Minor/Major to execute Full may result in a long execution time for Full, and if Full exceeds the quota limit and occupies more resources during execution, other table's optimization will be affected. Therefore, Full Optimization needs to be separated from the current Optimization Group, and the solutions are:
The disadvantage of Solution 1 is that when Full Optimization is not executed, these resources are idle. |
Beta Was this translation helpful? Give feedback.
-
My understanding is that there may be two issues here. Can we discuss them separately? |
Beta Was this translation helpful? Give feedback.
-
Meanwhile, employing different sorting rules for different scenarios can effectively improve query performance. There is a need for sorting data in order to improve query performance. In the context of full optimizing, utilizing a batch way to achieve optimizing and data sorting simultaneously is a good solution. Relevant discussion regarding this issue can be found here: #1360. Data sorting, as an operation, also involves rewriting files and can be accomplished together with optimizing in batch mode. This approach aligns with the characteristics of low execution frequency and high resource consumption. We should discuss it together. |
Beta Was this translation helpful? Give feedback.
-
Should the question of when the full task be invoked also be included in the discussion? It may be necessary to provide scheduling, or even manually trigger |
Beta Was this translation helpful? Give feedback.
-
Hi community,
I'm writing the proposal of introducing offline optimizing for Amoro and I'd love to discuss related scenarios in community before I complete the doc, anyone who has a need could join this discussion or get involved in the design and implementation.
There are three scenarios that drive me introduce offline optimizing in Amoro:
The first time an Iceberg table is loaded, especially after writing to an Iceberg V2 table for many days with FlinkCDC, it triggers a very long optimizing process and occupies a large amount of resources in the optimizer group. In the short term, the quota occupancy of this table will surge to an unreasonable value, causing other tables to postpone tasks due to resource competition. Introducing independent Offline optimizing here can avoid occupying the resources of the optimizer group and break through the quota limit in a short time.
In the Mixed Hive format, Hive snapshots are mainly promoted through Full optimizing, which has certain timeliness requirements. The resources limited by quota may not be sufficient to complete full optimizing within the expected time. the scenario is also applicable to TAG every single day in Iceberg table, more discussion points to(I don't like this discussion title) Support Time Travel of Arctic Mixed Hive Table #833
For any format table, we do not expect a batch optimizing task to conflict with a continuous optimizing task (inline/online). Although Iceberg has the ability to detect conflicts, it will cause unexpected retries and waste of resources. In addition, we hope that the batch-executed optimizing tasks and ongoing tasks can be naturally connected. The ideal way is to centrally manage them in a system like Amoro.
Beta Was this translation helpful? Give feedback.
All reactions