Fix Map/Parallel States checking container_runner#status! of finished states #296

agrare · 2024-11-21T18:07:34Z

If a state is finished, map/parallel states are re-checking to see if these are ready which causes us to check status of a container that has already been removed.

This leads to: kubeclient-4.12.0/lib/kubeclient/common.rb:130:in rescue in handle_exception': pods "floe-sleep-f9149969" not found (Kubeclient::ResourceNotFoundError)`

$ exe/floe --kubernetes --container-runner-options=namespace=manageiq examples/parallel.asl
I, [2024-11-21T13:06:50.967364 #295618]  INFO -- : Checking 1 workflows...
I, [2024-11-21T13:06:50.967503 #295618]  INFO -- : Running state: [Parallel:FunWithMath] with input [{}]...
I, [2024-11-21T13:06:50.967751 #295618]  INFO -- : Running state: [Task:Add] with input [{}]...
I, [2024-11-21T13:06:51.053196 #295618]  INFO -- : Running state: [Task:Subtract] with input [{}]...
I, [2024-11-21T13:07:03.315566 #295618]  INFO -- : Running state: [Task:Add] with input [{}]...Complete workflow - output: [{}]
/home/grare/adam/.gem/ruby/3.3.0/gems/kubeclient-4.12.0/lib/kubeclient/common.rb:130:in `rescue in handle_exception': pods "floe-sleep-2de8e6e1" not found (Kubeclient::ResourceNotFoundError)
	from /home/grare/adam/.gem/ruby/3.3.0/gems/kubeclient-4.12.0/lib/kubeclient/common.rb:120:in `handle_exception'
	from /home/grare/adam/.gem/ruby/3.3.0/gems/kubeclient-4.12.0/lib/kubeclient/common.rb:368:in `get_entity'
	from /home/grare/adam/.gem/ruby/3.3.0/gems/kubeclient-4.12.0/lib/kubeclient/common.rb:244:in `block (2 levels) in define_entity_methods'
	from /home/grare/adam/src/manageiq/floe/lib/floe/container_runner/kubernetes.rb:157:in `pod_info'
	from /home/grare/adam/src/manageiq/floe/lib/floe/container_runner/kubernetes.rb:69:in `status!'
	from /home/grare/adam/src/manageiq/floe/lib/floe/workflow/states/task.rb:66:in `running?'
	from /home/grare/adam/src/manageiq/floe/lib/floe/workflow/state.rb:90:in `ready?'
	from /home/grare/adam/src/manageiq/floe/lib/floe/workflow_base.rb:46:in `step_nonblock_ready?'
	from /home/grare/adam/src/manageiq/floe/lib/floe/workflow/states/parallel.rb:50:in `block in step_nonblock!'
	from /home/grare/adam/src/manageiq/floe/lib/floe/workflow/states/parallel.rb:49:in `each'
	from /home/grare/adam/src/manageiq/floe/lib/floe/workflow/states/parallel.rb:49:in `step_nonblock!'
	from /home/grare/adam/src/manageiq/floe/lib/floe/workflow/states/child_workflow_mixin.rb:10:in `run_nonblock!'
	from /home/grare/adam/src/manageiq/floe/lib/floe/workflow.rb:111:in `step_nonblock'
	from /home/grare/adam/src/manageiq/floe/lib/floe/workflow.rb:103:in `run_nonblock'
	from /home/grare/adam/src/manageiq/floe/lib/floe/workflow.rb:37:in `each'
	from /home/grare/adam/src/manageiq/floe/lib/floe/workflow.rb:37:in `block in wait'
	from <internal:kernel>:187:in `loop'
	from /home/grare/adam/src/manageiq/floe/lib/floe/workflow.rb:33:in `wait'
	from /home/grare/adam/src/manageiq/floe/lib/floe/cli.rb:29:in `run'
	from exe/floe:7:in `<main>'

This was only seen on kubernetes/openshift because docker&podman were eating the exception and returning nil.
Since I don't think you should check the status of a container that we already knowingly deleted, checking container status and getting an error back should be a Floe::ExecutionError.

agrare · 2024-11-21T18:12:10Z

lib/floe/workflow/states/task.rb

@@ -61,7 +61,8 @@ def finish(context)
        end

        def running?(context)
-          return true if waiting?(context)
+          return true  if waiting?(context)
+          return false if finished?(context)


This is the fix for this issue, I'd also like to make sure we don't check running? for a sub-workflow that we know to be finished but this is also good insurance at the Task level.

Fryguy · 2024-11-21T19:34:52Z

lib/floe/container_runner/docker.rb

@@ -182,8 +182,8 @@ def docker_event_status_to_event(status)

      def inspect_container(container_id)
        JSON.parse(docker!("inspect", container_id).output).first
-      rescue
-        nil
+      rescue AwesomeSpawn::CommandResultError => err


Not for this PR, but it's weird to me that docker! method fails and we catch AwesomeSpawn::CommandResultError, as that requires the caller to know the implementation of the callee. Does it make more sense to have docker! do the rescue/catch and raise something?

That being said, it's all self contained in the same file, so probably not a big deal.

Fixed - Fix Map/Parallel States checking container_runner#status! of finished states (#296) - Fix child workflow mixin tight loop (#297)

agrare added 3 commits November 21, 2024 13:03

Raise an exception if the container can't be found

88d5f58

Add State#started?,#finished? helpers

5d16097

Don't call container_runner#status! if state#finished?

162147f

agrare added the bug Something isn't working label Nov 21, 2024

agrare requested a review from Fryguy as a code owner November 21, 2024 18:07

agrare requested review from kbrock and removed request for Fryguy November 21, 2024 18:08

agrare commented Nov 21, 2024

View reviewed changes

agrare mentioned this pull request Nov 21, 2024

Fix child workflow mixin tight loop #297

Merged

kbrock merged commit 589d253 into ManageIQ:master Nov 21, 2024
5 checks passed

agrare deleted the fix_map_parallel_checking_finished_states branch November 21, 2024 19:14

Fryguy reviewed Nov 21, 2024

View reviewed changes

agrare added a commit that referenced this pull request Nov 21, 2024

Release v0.15.1

bd0d1cf

Fixed - Fix Map/Parallel States checking container_runner#status! of finished states (#296) - Fix child workflow mixin tight loop (#297)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Map/Parallel States checking container_runner#status! of finished states #296

Fix Map/Parallel States checking container_runner#status! of finished states #296

agrare commented Nov 21, 2024 •

edited

Loading

agrare Nov 21, 2024

Fryguy Nov 21, 2024

Fryguy Nov 21, 2024

Fix Map/Parallel States checking container_runner#status! of finished states #296

Fix Map/Parallel States checking container_runner#status! of finished states #296

Conversation

agrare commented Nov 21, 2024 • edited Loading

agrare Nov 21, 2024

Choose a reason for hiding this comment

Fryguy Nov 21, 2024

Choose a reason for hiding this comment

Fryguy Nov 21, 2024

Choose a reason for hiding this comment

agrare commented Nov 21, 2024 •

edited

Loading