Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map state add tolerated failure #282

Merged
merged 6 commits into from
Oct 8, 2024

Conversation

agrare
Copy link
Member

@agrare agrare commented Oct 7, 2024

Adds checking of child iterations of a Map state for failure, and based on ToleratedFailureCount/Percentage marks the Map state as failed or successful

@agrare agrare requested a review from Fryguy as a code owner October 7, 2024 19:10
@agrare agrare requested a review from kbrock October 7, 2024 19:10
@agrare agrare force-pushed the map_state_add_tolerated_failure branch from 13d5049 to 6ffb6e9 Compare October 8, 2024 15:42
Comment on lines +3 to +5
require_relative "input_output_mixin"
require_relative "non_terminal_mixin"
require_relative "retry_catch_mixin"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like these belong with the other requires in the global floe.rb

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is there as well but keeping those alphabetic means this fails to resolve the constant.
Alternatively, I can move all of the _mixin requires above the others in floe.rb

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like making the require not alphabetical.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the require being in the place that needs it and not trying to solve a dependency graph in floe.rb 😆

Copy link
Member

@kbrock kbrock Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh.

We have been putting all requires up front in floe.rb.
(I did not just say: we have been doing it this way and we have to continue doing that)

Do we want to get away from that?

wondering if InputOutputMixin and the others do not belong in /states/.
That way we can just require these and then all states and not have to deal with orders.

Does look like these 3 mixins are used by only states, so alternatively we can just move the mixins up front.

require_relative "floe/workflow/state"
require_relative "floe/workflow/states/input_output_mixin"
require_relative "floe/workflow/states/non_terminal_mixin"
# require_relative "floe/workflow/states/*_mixin"
require_relative "floe/workflow/states/choice"
require_relative "floe/workflow/states/fail"
require_relative "floe/workflow/states/map"
require_relative "floe/workflow/states/parallel"
require_relative "floe/workflow/states/pass"
require_relative "floe/workflow/states/succeed"
require_relative "floe/workflow/states/task"
require_relative "floe/workflow/states/wait"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have been putting all requires up front in floe.rb.
(I did not just say: we have been doing it this way and we have to continue doing that)

No I don't think we need to stop doing that, I think of require 'floe' as "just require everything" but if a particular class needs something specific it should require it that way a caller could require 'floe/workflow/context' and not having to bring in the memory of the entire gem.

Comment on lines -94 to -106
retrier = find_retrier(error["Error"]) if error
return if retrier.nil?

# If a different retrier is hit reset the context
if !context["State"].key?("RetryCount") || context["State"]["Retrier"] != retrier.error_equals
context["State"]["RetryCount"] = 0
context["State"]["Retrier"] = retrier.error_equals
end

context["State"]["RetryCount"] += 1

return if context["State"]["RetryCount"] > retrier.max_attempts

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always wanted the retrier / catcher to act just like a State.

You ask - do you have a retrier/catcher for me?
And if you do, then you run_nonblock it just like you do for the current state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignore ^ - I can play with this later

Comment on lines +37 to +42
catcher = find_catcher(error["Error"]) if error
return if catcher.nil?

context.next_state = catcher.next
context.output = catcher.result_path.set(context.input, error)
logger.info("Running state: [#{long_name}] with input [#{context.json_input}]...CatchError - next state: [#{context.next_state}] output: [#{context.json_output}]")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

catcher and retrier always seemed like the same thing.

you try and match it, and if it matches, then you set the next_state / output

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are similar for sure, not sure the same thing though

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignore ^

I can play with this refactor later

@agrare
Copy link
Member Author

agrare commented Oct 8, 2024

@kbrock I don't want to get too bogged down by retrier / catcher changes, is there a way we can tackle that in a separate PR and for this one just say we're using catcher/retrier from Map and Task? Maybe I even drop it from Map completely and we add it later.

@Fryguy Fryguy added the enhancement New feature or request label Oct 8, 2024
@kbrock kbrock self-assigned this Oct 8, 2024
Comment on lines +116 to +117
return true if tolerated_failure_count && num_failed < tolerated_failure_count
return true if tolerated_failure_percentage && (100 * num_failed / total.to_f) < tolerated_failure_percentage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is supposed to happen if both count and percentage are given (or is that not allowed)? I can think of percentages where you want, say, the minimum of 50% or 3. In that case, I think you need to check both clauses, and then return, as opposed to bailing out on the first one.

(This could have fallen through with the flip from failed? to success?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specs tend to say one vs the other.
In the branch that I had with all error checkings, I only allow or the other.

Also, you should not be able to state Next and End at the same time.
But in the short term, we've been letting these cases slide

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the states language spec:

If a "ToleratedFailurePercentage" field and a "ToleratedFailureCount" field are both specified, the Map State will fail if either threshold is breached.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an additional nuance here that I'm wondering if we need to code explicitly (not 100% sure). The spec says

A Map State MAY have a "ToleratedFailurePercentage" field whose value MUST be a number between zero and 100. Its default value is zero, which means the Map State will fail if any (i.e. more than 0%) of its items fail. A "ToleratedFailurePercentage" value of 100 means the interpreter will continue starting iterations even if all items fail.

So if 0 or 100 are specified, then those have defined meanings, and I'm concerned that silly floating point math might let those fall through the cracks? I wonder if we should have at least some tests for those specific values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another very strange question but is it possible for total to be 0 here or does some earlier check avoid that? Asking because this is a potential divide by zero here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the states language spec:

If a "ToleratedFailurePercentage" field and a "ToleratedFailureCount" field are both specified, the Map State will fail if either threshold is breached.

Right this will report a failure if either threshold is hit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another very strange question but is it possible for total to be 0 here or does some earlier check avoid that? Asking because this is a potential divide by zero here.

Technically this will return early if total is zero because num_failed will also be zero, but I can add an additional / explicit total.zero? above

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right this will report a failure if either threshold is hit

The way it's coded though, I'm not sure it does? Taking a (very) contrived example, if we had 4 items, 2 failures, threshold count of 2, and the threshold % of 25%, then the current code will return success true, but should return false.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I added an explicit check for ToleratedFailurePercentage==100 (interesting the spec says it is an integer not a float so that made == 100 easier, I did &.to_i in the initialize)
I think all concerns here are convered.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way it's coded though, I'm not sure it does? Taking a (very) contrived example, if we had 4 items, 2 failures, threshold count of 2, and the threshold % of 25%, then the current code will return success true, but should return false.

Oh yeah, this did flip when going from failed? to success?

@Fryguy
Copy link
Member

Fryguy commented Oct 8, 2024

I noticed the spec also has States.ExceedToleratedFailureThreshold as an error code. I'm not sure if/how we support these error codes, but just wanted to mention.

@kbrock kbrock merged commit 8299cc4 into ManageIQ:master Oct 8, 2024
4 of 5 checks passed
@agrare
Copy link
Member Author

agrare commented Oct 8, 2024

I noticed the spec also has States.ExceedToleratedFailureThreshold as an error code. I'm not sure if/how we support these error codes, but just wanted to mention.

Yeah I'm curious would this only be the error if ToleratedFailureCount or ToleratedFailurePercentage is present otherwise what would the error be? I'm taking the error from the failed sub-workflows in this case.

agrare added a commit that referenced this pull request Oct 28, 2024
Added
- Add WorkflowBase base class for Workflow (#279)
- Add tool for using the aws stepfunctions simulator (#244)
- Implement Map state (#184)
- Add Map State Tolerated Failure (#282)
- Run Map iterations in parallel up to MaxConcurrency (#283)
- Implement Parallel State (#291)

Changed
- More granular compare_key and determine path at initialization time (#274)
- For Choice validation, use instance variables and not payload (#277)
- Return ExceedToleratedFailureThreshold if ToleratedFailureCount/Percentage is present (#285)

Fixed
- Fix case on log messages (#280)
- Handle either ToleratedFailureCount or ToleratedFailurePercentage (#284)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants