Receiving TimerAlreadyCanceledException in TwoPhaseHASCO when running MLPlan #259

fmohr · 2021-07-06T14:55:16Z

Observing this error when running MLPlan in cluster experiments:

	Error message: Timer already cancelled.
	Error trace:
		java.util.Timer.sched(Timer.java:397)
		java.util.Timer.scheduleAtFixedRate(Timer.java:328)
		ai.libs.jaicore.concurrent.TrackableTimer.scheduleAtFixedRate(TrackableTimer.java:135)
		ai.libs.hasco.twophase.TwoPhaseHASCO.nextWithException(TwoPhaseHASCO.java:195)
		ai.libs.jaicore.basic.algorithm.AOptimizer.call(AOptimizer.java:134)
		ai.libs.jaicore.components.optimizingfactory.OptimizingFactory.nextWithException(OptimizingFactory.java:63)
		ai.libs.jaicore.components.optimizingfactory.OptimizingFactory.call(OptimizingFactory.java:80)
		ai.libs.mlplan.core.MLPlan.nextWithException(MLPlan.java:258)
		ai.libs.mlplan.core.MLPlan.call(MLPlan.java:291)
		naiveautoml.experiments.NaiveAutoMLExperimentRunner.evaluate(NaiveAutoMLExperimentRunner.java:217)
		ai.libs.jaicore.experiments.ExperimentRunner.conductExperiment(ExperimentRunner.java:217)
		ai.libs.jaicore.experiments.ExperimentRunner.lambda$randomlyConductExperiments$0(ExperimentRunner.java:104)
		java.lang.Thread.run(Thread.java:748)

Logs show that this stack trace is immediately followed by an indication of memory overflow:

java.lang.OutOfMemoryError: Java heap space

One dataset where this occured was the DNA dataset (https://www.openml.org/d/40670) using 24G memory.

The following message directly preceding the exception suggests that the error occurred when training a BayesNet:

2021-06-01 17:22:03.846 [ORGraphSearch-worker-1] INFO executor - Fitting the learner (class: ai.libs.mlplan.core.TimeTrackingLearnerWrapper) ai.libs.mlplan.core.TimeTrackingLearnerWrapper -
2021-06-01 17:23:03.691 [Global Timer] INFO InterruptionTimerTask - Executing interruption task 1293092700 with descriptor "Timeout for timed computation with thread Thread[ORGraphSearch-wo
2021-06-01 17:23:03.693 [Global Timer] INFO Interrupter - Interrupting Thread[ORGraphSearch-worker-1,5,main] on behalf of Thread[Global Timer,10,main] with reason InterruptionTimerTask [thr
2021-06-01 17:23:03.694 [Global Timer] INFO Interrupter - Interrupt accomplished. Interrupt flag of Thread[ORGraphSearch-worker-1,5,main]: true
2021-06-01 17:23:03.833 [Global Timer] INFO InterruptionTimerTask - Executing interruption task 1024325039 with descriptor "Timeout for timed computation with thread Thread[ORGraphSearch-wo
2021-06-01 17:23:03.834 [Global Timer] INFO Interrupter - Interrupting Thread[ORGraphSearch-worker-1,5,main] on behalf of Thread[Global Timer,10,main] with reason InterruptionTimerTask [thr
2021-06-01 17:23:03.835 [Global Timer] INFO Interrupter - Interrupt accomplished. Interrupt flag of Thread[ORGraphSearch-worker-1,5,main]: true

The question is really whether this can be avoided without spawning external processes.

The text was updated successfully, but these errors were encountered:

fmohr · 2021-07-06T18:42:35Z

Thinking more about this, I believe that there is really no solution to this problem except of spawning a new process. The problem with new processes is though that one needs to block a good deal of memory for each of them to avoid problems. This can easily become a total waste of resources.

Probably the best solution is to introduce an option that allows to run ML-Plan in process mode if there is an anticipated risk of memory overflows.

Then, more generally, it would be cool to add the opportunity to the process project of AILibs that allows to execute objects that implement both Callable<T extends Serializable> and Serializable in a separate process with specific resource limitations. One could then have a general executor for such operations that serializes the object to be executed and launches a new JVM with a general executor that unserializes the object, calls it, and serializes the T into some output file, which can then again be unserialized by the original process.

fmohr added the question label Jul 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Receiving TimerAlreadyCanceledException in TwoPhaseHASCO when running MLPlan #259

Receiving TimerAlreadyCanceledException in TwoPhaseHASCO when running MLPlan #259

fmohr commented Jul 6, 2021

fmohr commented Jul 6, 2021

Receiving TimerAlreadyCanceledException in TwoPhaseHASCO when running MLPlan #259

Receiving TimerAlreadyCanceledException in TwoPhaseHASCO when running MLPlan #259

Comments

fmohr commented Jul 6, 2021

fmohr commented Jul 6, 2021