How to optimize big or long time no optimizing iceberg table #2070

shidayang · 2023-10-10T06:15:19Z

shidayang
Oct 10, 2023
Collaborator

Iceberg's optimizing task requires indexing the delete data in memory first. If a table has too much delete data, it may cause the optimizer out of memory. Therefore, we introduced rocksdb to solve this problem of too many delete files.

This discussion focuses on two parts:

The first part is to collect everyone's feedback on the effectiveness of using rocksdb to solve the problem of large delete files in Iceberg tables and whether there are any issues encountered.

The second part is to explore other ways to optimize besides introducing rocksdb. One approach is to iterate optimization from historical snapshots, which can prevent reading too much delete data at once and causing OOM.

hameizi · 2023-10-11T07:42:32Z

hameizi
Oct 11, 2023
Collaborator

We used rocksdb for scenarios with frequent eq-delete file based on user usage. After sampling a 6-minute flame graph, we found that rocksdb had low performance in reading data (user disk was HDD), which resulted in poor optimizing performance.

0 replies

zhongqishang · 2023-10-24T09:07:03Z

zhongqishang
Oct 24, 2023
Collaborator

Thanks for you driving this disscussions.

The first part

In my scenarios, it works well for most tables. For some abnormal situations, there are still some problems.

In scenarios where there is insufficient memory, OOM or task loss is repeated, and the optimizer memory needs to be increased.
After running for a period of time, taskmanager will be killed by yarn,

2023-10-24 11:32:29,321 INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Worker container_e48_1697359748323_368051_01_000015 is terminated. Diagnostics: Container container_e48_1697359748323_368051_01_000015 marked as failed.
 Exit code:-104.
 Diagnostics:Container [pid=130824,containerID=container_e48_1697359748323_368051_01_000015] is running beyond physical memory limits. Current usage: 10.1 GB of 10 GB physical memory used; 13.2 GB of 21 GB virtual memory used. Killing container.
Dump of the process-tree for container_e48_1697359748323_368051_01_000015 :
	|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
	|- 131298 130824 130824 130824 (java) 89426 14490 14100217856 2645031 /usr/java/jdk1.8.0_141-cloudera/bin/java -Xmx9529458688 -Xms9529458688 -XX:MaxDirectMemorySize=704643072 -XX:MaxMetaspaceSize=268435456 -Dlog.file=/var/log/hadoop-yarn/container/application_1697359748323_368051/container_e48_1697359748323_368051_01_000015/taskmanager.log -Dlog4j.configuration=file:./log4j.properties -Dlog4j.configurationFile=file:./log4j.properties org.apache.flink.yarn.YarnTaskExecutorRunner -D taskmanager.memory.network.min=33554432b -D taskmanager.cpu.cores=1.0 -D taskmanager.memory.task.off-heap.size=536870912b -D taskmanager.memory.jvm-metaspace.size=268435456b -D external-resources=none -D taskmanager.memory.jvm-overhead.min=201326592b -D taskmanager.memory.framework.off-heap.size=134217728b -D taskmanager.memory.network.max=33554432b -D taskmanager.memory.framework.heap.size=134217728b -D taskmanager.memory.managed.size=33554432b -D taskmanager.memory.task.heap.size=9395240960b -D taskmanager.numberOfTaskSlots=1 -D taskmanager.memory.jvm-overhead.max=201326592b --configDir . -Djobmanager.rpc.address=hd-012.ld-hadoop.com -Dpipeline.classpaths= -Dweb.port=0 -Djobmanager.memory.off-heap.size=134217728b -Dweb.tmpdir=/tmp/flink-web-5401e8ad-48e6-4983-834e-1ff01ec62632 -Djobmanager.rpc.port=41836 -Drest.address=hd-012.ld-hadoop.com -Dsecurity.kerberos.login.keytab=/data/yarn/nm/usercache/hive/appcache/application_1697359748323_368051/container_e48_1697359748323_368051_01_000001/krb5.keytab -Djobmanager.memory.jvm-overhead.max=214748368b -Djobmanager.memory.jvm-overhead.min=214748368b -Dtaskmanager.resource-id=container_e48_1697359748323_368051_01_000015 -Dexecution.target=embedded -Dinternal.taskmanager.resource-id.metadata=hd-077.ld-hadoop.com:8041 -Dpipeline.jars=file:/data/yarn/nm/usercache/hive/appcache/application_1697359748323_368051/container_e48_1697359748323_368051_01_000001/amoro-ams-optimizer-0.5.0-jar-with-dependencies.jar -Djobmanager.memory.jvm-metaspace.size=268435456b -Djobmanager.memory.heap.size=1530082096b 
	|- 130824 130818 130824 130824 (bash) 1 1 116006912 359 /bin/bash -c /usr/java/jdk1.8.0_141-cloudera/bin/java -Xmx9529458688 -Xms9529458688 -XX:MaxDirectMemorySize=704643072 -XX:MaxMetaspaceSize=268435456 -Dlog.file=/var/log/hadoop-yarn/container/application_1697359748323_368051/container_e48_1697359748323_368051_01_000015/taskmanager.log -Dlog4j.configuration=file:./log4j.properties -Dlog4j.configurationFile=file:./log4j.properties org.apache.flink.yarn.YarnTaskExecutorRunner -D taskmanager.memory.network.min=33554432b -D taskmanager.cpu.cores=1.0 -D taskmanager.memory.task.off-heap.size=536870912b -D taskmanager.memory.jvm-metaspace.size=268435456b -D external-resources=none -D taskmanager.memory.jvm-overhead.min=201326592b -D taskmanager.memory.framework.off-heap.size=134217728b -D taskmanager.memory.network.max=33554432b -D taskmanager.memory.framework.heap.size=134217728b -D taskmanager.memory.managed.size=33554432b -D taskmanager.memory.task.heap.size=9395240960b -D taskmanager.numberOfTaskSlots=1 -D taskmanager.memory.jvm-overhead.max=201326592b --configDir . -Djobmanager.rpc.address='hd-012.ld-hadoop.com' -Dpipeline.classpaths='' -Dweb.port='0' -Djobmanager.memory.off-heap.size='134217728b' -Dweb.tmpdir='/tmp/flink-web-5401e8ad-48e6-4983-834e-1ff01ec62632' -Djobmanager.rpc.port='41836' -Drest.address='hd-012.ld-hadoop.com' -Dsecurity.kerberos.login.keytab='/data/yarn/nm/usercache/hive/appcache/application_1697359748323_368051/container_e48_1697359748323_368051_01_000001/krb5.keytab' -Djobmanager.memory.jvm-overhead.max='214748368b' -Djobmanager.memory.jvm-overhead.min='214748368b' -Dtaskmanager.resource-id='container_e48_1697359748323_368051_01_000015' -Dexecution.target='embedded' -Dinternal.taskmanager.resource-id.metadata='hd-077.ld-hadoop.com:8041' -Dpipeline.jars='file:/data/yarn/nm/usercache/hive/appcache/application_1697359748323_368051/container_e48_1697359748323_368051_01_000001/amoro-ams-optimizer-0.5.0-jar-with-dependencies.jar' -Djobmanager.memory.jvm-metaspace.size='268435456b' -Djobmanager.memory.heap.size='1530082096b' 1> /var/log/hadoop-yarn/container/application_1697359748323_368051/container_e48_1697359748323_368051_01_000015/taskmanager.out 2> /var/log/hadoop-yarn/container/application_1697359748323_368051/container_e48_1697359748323_368051_01_000015/taskmanager.err

The second part

If the total size of the input deletes files is controllable, out of memory situations can be effectively avoided.
This will be very useful and greatly improve the stability of the optimizer.

3 replies

shidayang Oct 25, 2023
Collaborator Author

Are you currently enabling the rocksdb parameters? if true
Killed by yarn may be because we did not include the memory usage of rocksdb in Flink's startup parameters, which caused the memory usage to exceed the taskManager's allocated size set by Flink, resulting in being killed by YARN.

zhongqishang Oct 25, 2023
Collaborator

Enabled rocksdb parameters.
I tried increasing the size of the off-heap memory, but it still happens.

shidayang Oct 31, 2023
Collaborator Author

Further analysis may be needed regarding memory usage, but for now, the configuration that forces killing of containers can be turned off.

zhongqishang · 2023-11-08T11:32:30Z

zhongqishang
Nov 8, 2023
Collaborator

I have an idea 💡

For scenarios where there are a lot of eq delete files.

When eq delete total record count is greater than (maybe) 1.5 times of data total record count, we write the data file primary key into StructLikeMap. When reading eq, we determine whether it exists in data. If it does not exist, we can directly Ignore it (currently all written to eq delete StructLikeMap), which can greatly reduce the overflow operation of eq delete. In this way, the size of StructLikeMap is controllable, so the memory usage is also controllable, depending on the size of the file.
This speeds up merge operations.

WDYT?
@shidayang @zhoujinsong

2 replies

zhoujinsong Nov 8, 2023
Collaborator

Very interesting solution. Nowadays, optimizing task's insert files are generally not too large(maybe just 128 mb), and I think this is indeed a very good solution. Although it involves an additional read operation for the insert file, it can greatly reduce the storage overhead of deleting records, whether in memory or on disk.

shidayang Nov 8, 2023
Collaborator Author

Good idea.
I think this direction is correct. Splitting the data file is indeed more flexible than splitting the delete file, but we need to specify the boundary conditions in detail, and also consider the effect of position-delete.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to optimize big or long time no optimizing iceberg table #2070

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to optimize big or long time no optimizing iceberg table #2070

shidayang Oct 10, 2023 Collaborator

Replies: 3 comments · 5 replies

hameizi Oct 11, 2023 Collaborator

zhongqishang Oct 24, 2023 Collaborator

The first part

The second part

shidayang Oct 25, 2023 Collaborator Author

zhongqishang Oct 25, 2023 Collaborator

shidayang Oct 31, 2023 Collaborator Author

zhongqishang Nov 8, 2023 Collaborator

zhoujinsong Nov 8, 2023 Collaborator

shidayang Nov 8, 2023 Collaborator Author

shidayang
Oct 10, 2023
Collaborator

Replies: 3 comments 5 replies

hameizi
Oct 11, 2023
Collaborator

zhongqishang
Oct 24, 2023
Collaborator

shidayang Oct 25, 2023
Collaborator Author

zhongqishang Oct 25, 2023
Collaborator

shidayang Oct 31, 2023
Collaborator Author

zhongqishang
Nov 8, 2023
Collaborator

zhoujinsong Nov 8, 2023
Collaborator

shidayang Nov 8, 2023
Collaborator Author