Fix Hive FileIO closing with FileIOTracker #11782

tom-s-powell · 2024-12-13T17:08:07Z

Fixes #11783. HiveCatalog currently uses a single FileIO instance that is passed to each newly instantiated HiveTableOperations/HiveViewOperations. This issue is FileIOTracker tracks each operation and results in the underlying FileIO being closed even when it is then going to be reused for future newTableOps calls.

This change will instead initialize a new FileIO instance so that they are 1:1 with HiveTableOperations/HiveViewOperations, thus safe to close when the operations go out of scope.

Fokko · 2024-12-23T00:25:12Z

Thanks for working on this @tom-s-powell. I was running into the same issue as you, see #11858.

While testing your solution, I noticed that my logs are flooded with:

24/12/23 00:23:14 WARN S3FileIO: Unclosed S3FileIO instance created by:
	org.apache.iceberg.aws.s3.S3FileIO.initialize(S3FileIO.java:371)
	org.apache.iceberg.CatalogUtil.loadFileIO(CatalogUtil.java:369)
	org.apache.iceberg.hive.HiveCatalog.lambda$initialize$1(HiveCatalog.java:126)
	org.apache.iceberg.hive.HiveCatalog.newViewOps(HiveCatalog.java:678)
	org.apache.iceberg.view.BaseMetastoreViewCatalog.loadView(BaseMetastoreViewCatalog.java:53)
	org.apache.iceberg.catalog.ViewCatalog.viewExists(ViewCatalog.java:65)
	org.apache.iceberg.hive.HiveCatalog$ViewAwareTableBuilder.createOrReplaceTransaction(HiveCatalog.java:837)
	org.apache.iceberg.CachingCatalog$CachingTableBuilder.createOrReplaceTransaction(CachingCatalog.java:300)
	org.apache.iceberg.spark.SparkCatalog.stageCreateOrReplace(SparkCatalog.java:301)
	org.apache.spark.sql.connector.catalog.StagingTableCatalog.stageCreateOrReplace(StagingTableCatalog.java:191)
	org.apache.spark.sql.execution.datasources.v2.AtomicReplaceTableAsSelectExec.run(WriteToDataSourceV2Exec.scala:207)
	org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
	org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
	org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
	org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:107)
	org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125)
	org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201)
	org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108)
	org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
	org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66)
	org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:107)
	org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
	org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461)
	org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76)
	org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461)
	org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32)
	org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
	org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
	org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
	org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
	org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:437)
	org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:98)
	org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:85)
	org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:83)
	org.apache.spark.sql.Dataset.<init>(Dataset.scala:220)
	org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
	org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
	org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
	org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:638)
	org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
	org.apache.spark.sql.SparkSession.sql(SparkSession.scala:629)
	org.apache.spark.sql.SparkSession.sql(SparkSession.scala:659)
	jdk.internal.reflect.GeneratedMethodAccessor95.invoke(Unknown Source)
	java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	java.base/java.lang.reflect.Method.invoke(Method.java:566)
	py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	py4j.Gateway.invoke(Gateway.java:282)
	py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	py4j.commands.CallCommand.execute(CallCommand.java:79)
	py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	java.base/java.lang.Thread.run(Thread.java:829)
24/12/23 00:23:15 WARN S3FileIO: Unclosed S3FileIO instance created by:
	org.apache.iceberg.aws.s3.S3FileIO.initialize(S3FileIO.java:371)
	org.apache.iceberg.CatalogUtil.loadFileIO(CatalogUtil.java:369)
	org.apache.iceberg.hive.HiveCatalog.lambda$initialize$1(HiveCatalog.java:126)
	org.apache.iceberg.hive.HiveCatalog.newViewOps(HiveCatalog.java:678)
	org.apache.iceberg.view.BaseMetastoreViewCatalog.loadView(BaseMetastoreViewCatalog.java:53)
	org.apache.iceberg.catalog.ViewCatalog.viewExists(ViewCatalog.java:65)
	org.apache.iceberg.hive.HiveCatalog$ViewAwareTableBuilder.createOrReplaceTransaction(HiveCatalog.java:837)
	org.apache.iceberg.CachingCatalog$CachingTableBuilder.createOrReplaceTransaction(CachingCatalog.java:300)
	org.apache.iceberg.spark.SparkCatalog.stageCreateOrReplace(SparkCatalog.java:301)
	org.apache.spark.sql.connector.catalog.StagingTableCatalog.stageCreateOrReplace(StagingTableCatalog.java:191)
	org.apache.spark.sql.execution.datasources.v2.AtomicReplaceTableAsSelectExec.run(WriteToDataSourceV2Exec.scala:207)
	org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
	org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
	org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
	org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:107)
	org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125)
	org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201)
	org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108)
	org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
	org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66)
	org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:107)
	org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
	org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461)
	org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76)
	org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461)
	org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32)
	org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
	org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
	org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
	org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
	org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:437)
	org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:98)
	org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:85)
	org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:83)
	org.apache.spark.sql.Dataset.<init>(Dataset.scala:220)
	org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
	org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
	org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
	org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:638)
	org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
	org.apache.spark.sql.SparkSession.sql(SparkSession.scala:629)
	org.apache.spark.sql.SparkSession.sql(SparkSession.scala:659)
	jdk.internal.reflect.GeneratedMethodAccessor95.invoke(Unknown Source)
	java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	java.base/java.lang.reflect.Method.invoke(Method.java:566)
	py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	py4j.Gateway.invoke(Gateway.java:282)
	py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	py4j.commands.CallCommand.execute(CallCommand.java:79)
	py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	java.base/java.lang.Thread.run(Thread.java:829)
24/12/23 00:23:15 WARN S3FileIO: Unclosed S3FileIO instance created by:
	org.apache.iceberg.aws.s3.S3FileIO.initialize(S3FileIO.java:371)
	org.apache.iceberg.CatalogUtil.loadFileIO(CatalogUtil.java:369)
	org.apache.iceberg.hive.HiveCatalog.lambda$initialize$1(HiveCatalog.java:126)
	org.apache.iceberg.hive.HiveCatalog.newViewOps(HiveCatalog.java:678)
	org.apache.iceberg.view.BaseMetastoreViewCatalog.loadView(BaseMetastoreViewCatalog.java:53)
	org.apache.iceberg.catalog.ViewCatalog.viewExists(ViewCatalog.java:65)
	org.apache.iceberg.hive.HiveCatalog$ViewAwareTableBuilder.createOrReplaceTransaction(HiveCatalog.java:837)
	org.apache.iceberg.CachingCatalog$CachingTableBuilder.createOrReplaceTransaction(CachingCatalog.java:300)
	org.apache.iceberg.spark.SparkCatalog.stageCreateOrReplace(SparkCatalog.java:301)
	org.apache.spark.sql.connector.catalog.StagingTableCatalog.stageCreateOrReplace(StagingTableCatalog.java:191)
	org.apache.spark.sql.execution.datasources.v2.AtomicReplaceTableExec.run(ReplaceTableExec.scala:75)
	org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
	org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
	org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
	org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:107)
	org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125)
	org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201)
	org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108)
	org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
	org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66)
	org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:107)
	org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
	org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461)
	org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76)
	org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461)
	org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32)
	org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
	org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
	org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
	org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
	org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:437)
	org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:98)
	org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:85)
	org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:83)
	org.apache.spark.sql.Dataset.<init>(Dataset.scala:220)
	org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
	org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
	org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
	org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:638)
	org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
	org.apache.spark.sql.SparkSession.sql(SparkSession.scala:629)
	org.apache.spark.sql.SparkSession.sql(SparkSession.scala:659)
	jdk.internal.reflect.GeneratedMethodAccessor95.invoke(Unknown Source)
	java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	java.base/java.lang.reflect.Method.invoke(Method.java:566)
	py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	py4j.Gateway.invoke(Gateway.java:282)
	py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	py4j.commands.CallCommand.execute(CallCommand.java:79)
	py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	java.base/java.lang.Thread.run(Thread.java:829)

Did you try your fix in the wild? (I must admit it is not very easy to test).

github-actions · 2025-01-23T00:14:41Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions · 2025-01-30T00:15:10Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

Fix Hive FileIO closing with FileIOTracker.

3b89e68

github-actions bot added the hive label Dec 13, 2024

remove unused

5943fed

github-actions bot added the stale label Jan 23, 2025

github-actions bot closed this Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Hive FileIO closing with FileIOTracker #11782

Fix Hive FileIO closing with FileIOTracker #11782

Uh oh!

tom-s-powell commented Dec 13, 2024 •

edited

Loading

Uh oh!

Fokko commented Dec 23, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Jan 23, 2025

Uh oh!

github-actions bot commented Jan 30, 2025

Uh oh!

Uh oh!

Fix Hive FileIO closing with FileIOTracker #11782

Fix Hive FileIO closing with FileIOTracker #11782

Uh oh!

Conversation

tom-s-powell commented Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fokko commented Dec 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 23, 2025

Uh oh!

github-actions bot commented Jan 30, 2025

Uh oh!

Uh oh!

tom-s-powell commented Dec 13, 2024 •

edited

Loading

Fokko commented Dec 23, 2024 •

edited

Loading