-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Map state add tolerated failure #282
Map state add tolerated failure #282
Conversation
13d5049
to
6ffb6e9
Compare
require_relative "input_output_mixin" | ||
require_relative "non_terminal_mixin" | ||
require_relative "retry_catch_mixin" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like these belong with the other requires
in the global floe.rb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is there as well but keeping those alphabetic means this fails to resolve the constant.
Alternatively, I can move all of the _mixin requires above the others in floe.rb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like making the require not alphabetical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the require being in the place that needs it and not trying to solve a dependency graph in floe.rb 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh.
We have been putting all requires up front in floe.rb
.
(I did not just say: we have been doing it this way and we have to continue doing that)
Do we want to get away from that?
wondering if InputOutputMixin and the others do not belong in /states/
.
That way we can just require these and then all states and not have to deal with orders.
Does look like these 3 mixins are used by only states, so alternatively we can just move the mixins up front.
require_relative "floe/workflow/state"
require_relative "floe/workflow/states/input_output_mixin"
require_relative "floe/workflow/states/non_terminal_mixin"
# require_relative "floe/workflow/states/*_mixin"
require_relative "floe/workflow/states/choice"
require_relative "floe/workflow/states/fail"
require_relative "floe/workflow/states/map"
require_relative "floe/workflow/states/parallel"
require_relative "floe/workflow/states/pass"
require_relative "floe/workflow/states/succeed"
require_relative "floe/workflow/states/task"
require_relative "floe/workflow/states/wait"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have been putting all requires up front in floe.rb.
(I did not just say: we have been doing it this way and we have to continue doing that)
No I don't think we need to stop doing that, I think of require 'floe'
as "just require everything" but if a particular class needs something specific it should require it that way a caller could require 'floe/workflow/context'
and not having to bring in the memory of the entire gem.
retrier = find_retrier(error["Error"]) if error | ||
return if retrier.nil? | ||
|
||
# If a different retrier is hit reset the context | ||
if !context["State"].key?("RetryCount") || context["State"]["Retrier"] != retrier.error_equals | ||
context["State"]["RetryCount"] = 0 | ||
context["State"]["Retrier"] = retrier.error_equals | ||
end | ||
|
||
context["State"]["RetryCount"] += 1 | ||
|
||
return if context["State"]["RetryCount"] > retrier.max_attempts | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always wanted the retrier / catcher to act just like a State
.
You ask - do you have a retrier/catcher for me?
And if you do, then you run_nonblock
it just like you do for the current state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignore ^ - I can play with this later
catcher = find_catcher(error["Error"]) if error | ||
return if catcher.nil? | ||
|
||
context.next_state = catcher.next | ||
context.output = catcher.result_path.set(context.input, error) | ||
logger.info("Running state: [#{long_name}] with input [#{context.json_input}]...CatchError - next state: [#{context.next_state}] output: [#{context.json_output}]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
catcher and retrier always seemed like the same thing.
you try and match it, and if it matches, then you set the next_state / output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are similar for sure, not sure the same thing though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignore ^
I can play with this refactor later
@kbrock I don't want to get too bogged down by retrier / catcher changes, is there a way we can tackle that in a separate PR and for this one just say we're using catcher/retrier from Map and Task? Maybe I even drop it from Map completely and we add it later. |
return true if tolerated_failure_count && num_failed < tolerated_failure_count | ||
return true if tolerated_failure_percentage && (100 * num_failed / total.to_f) < tolerated_failure_percentage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is supposed to happen if both count and percentage are given (or is that not allowed)? I can think of percentages where you want, say, the minimum of 50% or 3. In that case, I think you need to check both clauses, and then return, as opposed to bailing out on the first one.
(This could have fallen through with the flip from failed? to success?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specs tend to say one vs the other.
In the branch that I had with all error checkings, I only allow or the other.
Also, you should not be able to state Next and End at the same time.
But in the short term, we've been letting these cases slide
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the states language spec:
If a "ToleratedFailurePercentage" field and a "ToleratedFailureCount" field are both specified, the Map State will fail if either threshold is breached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an additional nuance here that I'm wondering if we need to code explicitly (not 100% sure). The spec says
A Map State MAY have a "ToleratedFailurePercentage" field whose value MUST be a number between zero and 100. Its default value is zero, which means the Map State will fail if any (i.e. more than 0%) of its items fail. A "ToleratedFailurePercentage" value of 100 means the interpreter will continue starting iterations even if all items fail.
So if 0 or 100 are specified, then those have defined meanings, and I'm concerned that silly floating point math might let those fall through the cracks? I wonder if we should have at least some tests for those specific values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another very strange question but is it possible for total
to be 0 here or does some earlier check avoid that? Asking because this is a potential divide by zero here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the states language spec:
If a "ToleratedFailurePercentage" field and a "ToleratedFailureCount" field are both specified, the Map State will fail if either threshold is breached.
Right this will report a failure if either threshold is hit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another very strange question but is it possible for total to be 0 here or does some earlier check avoid that? Asking because this is a potential divide by zero here.
Technically this will return early if total is zero because num_failed will also be zero, but I can add an additional / explicit total.zero? above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right this will report a failure if either threshold is hit
The way it's coded though, I'm not sure it does? Taking a (very) contrived example, if we had 4 items, 2 failures, threshold count of 2, and the threshold % of 25%, then the current code will return success true, but should return false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I added an explicit check for ToleratedFailurePercentage==100 (interesting the spec says it is an integer not a float so that made == 100
easier, I did &.to_i
in the initialize)
I think all concerns here are convered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way it's coded though, I'm not sure it does? Taking a (very) contrived example, if we had 4 items, 2 failures, threshold count of 2, and the threshold % of 25%, then the current code will return success true, but should return false.
Oh yeah, this did flip when going from failed?
to success?
I noticed the spec also has |
Yeah I'm curious would this only be the error if |
Added - Add WorkflowBase base class for Workflow (#279) - Add tool for using the aws stepfunctions simulator (#244) - Implement Map state (#184) - Add Map State Tolerated Failure (#282) - Run Map iterations in parallel up to MaxConcurrency (#283) - Implement Parallel State (#291) Changed - More granular compare_key and determine path at initialization time (#274) - For Choice validation, use instance variables and not payload (#277) - Return ExceedToleratedFailureThreshold if ToleratedFailureCount/Percentage is present (#285) Fixed - Fix case on log messages (#280) - Handle either ToleratedFailureCount or ToleratedFailurePercentage (#284)
Adds checking of child iterations of a Map state for failure, and based on ToleratedFailureCount/Percentage marks the Map state as failed or successful