Map state add tolerated failure #282

agrare · 2024-10-07T19:10:07Z

Adds checking of child iterations of a Map state for failure, and based on ToleratedFailureCount/Percentage marks the Map state as failed or successful

kbrock · 2024-10-08T15:48:30Z

lib/floe/workflow/states/task.rb

+require_relative "input_output_mixin"
+require_relative "non_terminal_mixin"
+require_relative "retry_catch_mixin"


Feels like these belong with the other requires in the global floe.rb

Yes, it is there as well but keeping those alphabetic means this fails to resolve the constant.
Alternatively, I can move all of the _mixin requires above the others in floe.rb

I like making the require not alphabetical.

I like the require being in the place that needs it and not trying to solve a dependency graph in floe.rb 😆

huh.

We have been putting all requires up front in floe.rb.
(I did not just say: we have been doing it this way and we have to continue doing that)

Do we want to get away from that?

wondering if InputOutputMixin and the others do not belong in /states/.
That way we can just require these and then all states and not have to deal with orders.

Does look like these 3 mixins are used by only states, so alternatively we can just move the mixins up front.

require_relative "floe/workflow/state" require_relative "floe/workflow/states/input_output_mixin" require_relative "floe/workflow/states/non_terminal_mixin" # require_relative "floe/workflow/states/*_mixin" require_relative "floe/workflow/states/choice" require_relative "floe/workflow/states/fail" require_relative "floe/workflow/states/map" require_relative "floe/workflow/states/parallel" require_relative "floe/workflow/states/pass" require_relative "floe/workflow/states/succeed" require_relative "floe/workflow/states/task" require_relative "floe/workflow/states/wait"

We have been putting all requires up front in floe.rb.
(I did not just say: we have been doing it this way and we have to continue doing that)

No I don't think we need to stop doing that, I think of require 'floe' as "just require everything" but if a particular class needs something specific it should require it that way a caller could require 'floe/workflow/context' and not having to bring in the memory of the entire gem.

kbrock · 2024-10-08T15:54:21Z

lib/floe/workflow/states/task.rb

-          retrier = find_retrier(error["Error"]) if error
-          return if retrier.nil?
-
-          # If a different retrier is hit reset the context
-          if !context["State"].key?("RetryCount") || context["State"]["Retrier"] != retrier.error_equals
-            context["State"]["RetryCount"] = 0
-            context["State"]["Retrier"]    = retrier.error_equals
-          end
-
-          context["State"]["RetryCount"] += 1
-
-          return if context["State"]["RetryCount"] > retrier.max_attempts
-


Always wanted the retrier / catcher to act just like a State.

You ask - do you have a retrier/catcher for me?
And if you do, then you run_nonblock it just like you do for the current state.

ignore ^ - I can play with this later

kbrock · 2024-10-08T15:56:17Z

lib/floe/workflow/states/retry_catch_mixin.rb

+          catcher = find_catcher(error["Error"]) if error
+          return if catcher.nil?
+
+          context.next_state = catcher.next
+          context.output     = catcher.result_path.set(context.input, error)
+          logger.info("Running state: [#{long_name}] with input [#{context.json_input}]...CatchError - next state: [#{context.next_state}] output: [#{context.json_output}]")


catcher and retrier always seemed like the same thing.

you try and match it, and if it matches, then you set the next_state / output

They are similar for sure, not sure the same thing though

ignore ^

I can play with this refactor later

agrare · 2024-10-08T16:06:49Z

@kbrock I don't want to get too bogged down by retrier / catcher changes, is there a way we can tackle that in a separate PR and for this one just say we're using catcher/retrier from Map and Task? Maybe I even drop it from Map completely and we add it later.

lib/floe/workflow/states/map.rb

Fryguy · 2024-10-08T16:37:18Z

lib/floe/workflow/states/map.rb

+          return true if tolerated_failure_count      && num_failed < tolerated_failure_count
+          return true if tolerated_failure_percentage && (100 * num_failed / total.to_f) < tolerated_failure_percentage


What is supposed to happen if both count and percentage are given (or is that not allowed)? I can think of percentages where you want, say, the minimum of 50% or 3. In that case, I think you need to check both clauses, and then return, as opposed to bailing out on the first one.

(This could have fallen through with the flip from failed? to success?)

specs tend to say one vs the other.
In the branch that I had with all error checkings, I only allow or the other.

Also, you should not be able to state Next and End at the same time.
But in the short term, we've been letting these cases slide

From the states language spec:

If a "ToleratedFailurePercentage" field and a "ToleratedFailureCount" field are both specified, the Map State will fail if either threshold is breached.

There's an additional nuance here that I'm wondering if we need to code explicitly (not 100% sure). The spec says

A Map State MAY have a "ToleratedFailurePercentage" field whose value MUST be a number between zero and 100. Its default value is zero, which means the Map State will fail if any (i.e. more than 0%) of its items fail. A "ToleratedFailurePercentage" value of 100 means the interpreter will continue starting iterations even if all items fail.

So if 0 or 100 are specified, then those have defined meanings, and I'm concerned that silly floating point math might let those fall through the cracks? I wonder if we should have at least some tests for those specific values.

Another very strange question but is it possible for total to be 0 here or does some earlier check avoid that? Asking because this is a potential divide by zero here.

From the states language spec:

If a "ToleratedFailurePercentage" field and a "ToleratedFailureCount" field are both specified, the Map State will fail if either threshold is breached.

Right this will report a failure if either threshold is hit

Another very strange question but is it possible for total to be 0 here or does some earlier check avoid that? Asking because this is a potential divide by zero here.

Technically this will return early if total is zero because num_failed will also be zero, but I can add an additional / explicit total.zero? above

Right this will report a failure if either threshold is hit

The way it's coded though, I'm not sure it does? Taking a (very) contrived example, if we had 4 items, 2 failures, threshold count of 2, and the threshold % of 25%, then the current code will return success true, but should return false.

Okay I added an explicit check for ToleratedFailurePercentage==100 (interesting the spec says it is an integer not a float so that made == 100 easier, I did &.to_i in the initialize)
I think all concerns here are convered.

The way it's coded though, I'm not sure it does? Taking a (very) contrived example, if we had 4 items, 2 failures, threshold count of 2, and the threshold % of 25%, then the current code will return success true, but should return false.

Oh yeah, this did flip when going from failed? to success?

Fryguy · 2024-10-08T17:11:26Z

I noticed the spec also has States.ExceedToleratedFailureThreshold as an error code. I'm not sure if/how we support these error codes, but just wanted to mention.

agrare · 2024-10-08T18:01:30Z

I noticed the spec also has States.ExceedToleratedFailureThreshold as an error code. I'm not sure if/how we support these error codes, but just wanted to mention.

Yeah I'm curious would this only be the error if ToleratedFailureCount or ToleratedFailurePercentage is present otherwise what would the error be? I'm taking the error from the failed sub-workflows in this case.

Added - Add WorkflowBase base class for Workflow (#279) - Add tool for using the aws stepfunctions simulator (#244) - Implement Map state (#184) - Add Map State Tolerated Failure (#282) - Run Map iterations in parallel up to MaxConcurrency (#283) - Implement Parallel State (#291) Changed - More granular compare_key and determine path at initialization time (#274) - For Choice validation, use instance variables and not payload (#277) - Return ExceedToleratedFailureThreshold if ToleratedFailureCount/Percentage is present (#285) Fixed - Fix case on log messages (#280) - Handle either ToleratedFailureCount or ToleratedFailurePercentage (#284)

agrare added 2 commits October 7, 2024 14:30

Implement ToleratedFailure

b5e79c6

Add each_item_processor

6a4041c

agrare requested a review from Fryguy as a code owner October 7, 2024 19:10

agrare requested a review from kbrock October 7, 2024 19:10

Move retry/catch to a mixin

6ffb6e9

agrare force-pushed the map_state_add_tolerated_failure branch from 13d5049 to 6ffb6e9 Compare October 8, 2024 15:42

kbrock reviewed Oct 8, 2024

View reviewed changes

lib/floe/workflow/states/map.rb Outdated Show resolved Hide resolved

kbrock reviewed Oct 8, 2024

View reviewed changes

lib/floe/workflow/states/map.rb Outdated Show resolved Hide resolved

Fryguy added the enhancement New feature or request label Oct 8, 2024

kbrock self-assigned this Oct 8, 2024

Map#success? instead of failure?

6c32fc9

Fryguy reviewed Oct 8, 2024

View reviewed changes

Explicitly handle case with no iterations

8115aa9

Check ToleratedFailurePercentage==100

f516eaf

kbrock merged commit 8299cc4 into ManageIQ:master Oct 8, 2024
4 of 5 checks passed

agrare mentioned this pull request Oct 8, 2024

Handle either tolerated_failure_count or tolerated_failure_percentage #284

Merged

agrare mentioned this pull request Oct 8, 2024

Return ExceedToleratedFailureThreshold if ToleratedFailureCount/Percentage is present #285

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Map state add tolerated failure #282

Map state add tolerated failure #282

agrare commented Oct 7, 2024 •

edited

Loading

kbrock Oct 8, 2024

agrare Oct 8, 2024

kbrock Oct 8, 2024

agrare Oct 8, 2024

kbrock Oct 8, 2024 •

edited

Loading

agrare Oct 8, 2024

kbrock Oct 8, 2024

kbrock Oct 8, 2024

kbrock Oct 8, 2024

agrare Oct 8, 2024

kbrock Oct 8, 2024

agrare commented Oct 8, 2024

Fryguy Oct 8, 2024

kbrock Oct 8, 2024

Fryguy Oct 8, 2024

Fryguy Oct 8, 2024

Fryguy Oct 8, 2024

agrare Oct 8, 2024

agrare Oct 8, 2024

Fryguy Oct 8, 2024

agrare Oct 8, 2024

agrare Oct 8, 2024

Fryguy commented Oct 8, 2024

agrare commented Oct 8, 2024

		return true if tolerated_failure_count && num_failed < tolerated_failure_count
		return true if tolerated_failure_percentage && (100 * num_failed / total.to_f) < tolerated_failure_percentage

Map state add tolerated failure #282

Map state add tolerated failure #282

Conversation

agrare commented Oct 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kbrock Oct 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agrare commented Oct 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fryguy commented Oct 8, 2024

agrare commented Oct 8, 2024

agrare commented Oct 7, 2024 •

edited

Loading

kbrock Oct 8, 2024 •

edited

Loading