-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add more thread pools #465
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #465 +/- ##
============================================
+ Coverage 71.84% 71.91% +0.06%
+ Complexity 622 621 -1
============================================
Files 78 78
Lines 3136 3133 -3
Branches 236 234 -2
============================================
Hits 2253 2253
Misses 776 776
+ Partials 107 104 -3 ☔ View full report in Codecov by Sentry. |
One Flaky failure on this PR first run: it's the awaiting for the ML config index that took over a minute. Bumped the max failures up to 4 since there are 3 such calls/test failures for this. |
9cbc7a4
to
f36f0b7
Compare
Thanks for making these changes!
When you say "testing on a real cluster with ml commons set up", do you mean a local multi node cluster with flow framework and ml commons and using the flow-framework APIs? |
No, a full EC2 cluster using OpenSearch CDK on nodes with more than 4 processors. |
Summary of specific things that happen on startup that this change works around:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for digging into this @dbwiddis! LGTM overall with few questions.
src/test/java/org/opensearch/flowframework/rest/FlowFrameworkRestApiIT.java
Show resolved
Hide resolved
Signed-off-by: Daniel Widdis <widdis@gmail.com>
Signed-off-by: Daniel Widdis <widdis@gmail.com>
Signed-off-by: Daniel Widdis <widdis@gmail.com>
f36f0b7
to
1fb3140
Compare
Still getting failed ml config index exists errors even with this delay. I'm not sure what else we can do.
|
Seems the process node tests are leaking threads :
|
Signed-off-by: Daniel Widdis <widdis@gmail.com>
Yeah, saw that. The "scheduler" thread would have interrupted that... Not sure that's really a problem or just where the code happened to be when the whole test suite timed out for some other reason. In any case, I saw a lot of simplification we could do in process node after switching to action future and put that in. |
Signed-off-by: Daniel Widdis <widdis@gmail.com>
Tested locally with threadpool max of 1 and failed on too long provisioning, so also increased the minimum "max" scale for the pools even with a small number of processors. Seemed to make the tests go a lot faster, too.... |
* Add more thread pools Signed-off-by: Daniel Widdis <widdis@gmail.com> * Increase maxFailures to 4 Signed-off-by: Daniel Widdis <widdis@gmail.com> * Wait before starting tests Signed-off-by: Daniel Widdis <widdis@gmail.com> * Improve ProcessNode timeout Signed-off-by: Daniel Widdis <widdis@gmail.com> * Increase minimum thread pool requirement Signed-off-by: Daniel Widdis <widdis@gmail.com> --------- Signed-off-by: Daniel Widdis <widdis@gmail.com> (cherry picked from commit a812e51) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Add more thread pools (#465) * Add more thread pools * Increase maxFailures to 4 * Wait before starting tests * Improve ProcessNode timeout * Increase minimum thread pool requirement --------- (cherry picked from commit a812e51) Signed-off-by: Daniel Widdis <widdis@gmail.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Description
Spent several more hours today digging into the Multi-node integ test failures. In summary:
Playing around with the thread pool showed it was very limiting. Split out provisioning, deprovisioning, and retry onto separate thread pools.
I'm confident that this flakiness is due to a lot of setup work that is required in ML Commons and takes longer on the GitHub runners due to smaller numbers of threads available. Testing on a real cluster with ML Commons set up hasn't shown any issues.
Issues Resolved
More tweaks to fixes of #461
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.