Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more thread pools #465

Merged
merged 5 commits into from
Jan 30, 2024
Merged

Conversation

dbwiddis
Copy link
Member

Description

Spent several more hours today digging into the Multi-node integ test failures. In summary:

  • The first REST call to other nodes sometimes times out. Subsequent calls always seem to succeed. Added a single retry in the TestHelpers class that has caught all of these instances.
  • Workflow provisioning is flakily delayed by ML Commons create connector. With a shorter timeout, we just get a failed provisioning. Adding a longer timeout results in continuing on to deploy model but a failure "Deploy model failed with error : {"smSrrtMdQgKo8ZLMpnTxtA":"The ML encryption master key has not been initialized yet. Please retry after waiting for 10 seconds."}". This passes the second time with the gradle retry.

Playing around with the thread pool showed it was very limiting. Split out provisioning, deprovisioning, and retry onto separate thread pools.

I'm confident that this flakiness is due to a lot of setup work that is required in ML Commons and takes longer on the GitHub runners due to smaller numbers of threads available. Testing on a real cluster with ML Commons set up hasn't shown any issues.

Issues Resolved

More tweaks to fixes of #461

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link

codecov bot commented Jan 28, 2024

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (5d452f3) 71.84% compared to head (4c43204) 71.91%.

Files Patch % Lines
...rk/transport/ProvisionWorkflowTransportAction.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #465      +/-   ##
============================================
+ Coverage     71.84%   71.91%   +0.06%     
+ Complexity      622      621       -1     
============================================
  Files            78       78              
  Lines          3136     3133       -3     
  Branches        236      234       -2     
============================================
  Hits           2253     2253              
  Misses          776      776              
+ Partials        107      104       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dbwiddis
Copy link
Member Author

One Flaky failure on this PR first run: it's the awaiting for the ML config index that took over a minute. Bumped the max failures up to 4 since there are 3 such calls/test failures for this.

@dbwiddis dbwiddis force-pushed the more-thread-pools branch 3 times, most recently from 9cbc7a4 to f36f0b7 Compare January 29, 2024 17:03
@amitgalitz
Copy link
Member

Thanks for making these changes!
quick clarifying question:

I'm confident that this flakiness is due to a lot of setup work that is required in ML Commons and takes longer on the GitHub runners due to smaller numbers of threads available. Testing on a real cluster with ML Commons set up hasn't shown any issues.

When you say "testing on a real cluster with ml commons set up", do you mean a local multi node cluster with flow framework and ml commons and using the flow-framework APIs?

@dbwiddis
Copy link
Member Author

"testing on a real cluster with ml commons set up", do you mean a local multi node cluster with flow framework and ml commons and using the flow-framework APIs?

No, a full EC2 cluster using OpenSearch CDK on nodes with more than 4 processors.

@dbwiddis
Copy link
Member Author

Summary of specific things that happen on startup that this change works around:

  1. REST client initiation is done by a superclass of the RestApiIT in a @Before method. Immediately starting tests doesn't permit this to complete on other nodes, resulting in one category of flaky tests. Waiting more than about 5 seconds before starting the tests seems to avoid this. This is fixed with both a single retry on makeRequest, and a longer startup delay. I wouldn't normally include that retry in main code but I think it's fine in a test helper class.
  2. Create connector call sometimes takes a long time due to external calls. The existing default 10s delay was insufficient. Increased the default delay.
  3. Creation of .ml-commons-config index doesn't happen until (1) ML Commons plugin init which starts a cron job (2) second cycle of the job running at 10s increments, meaning 20 seconds after node startup. Due to very few processors to run threads, if the full RestApiIT is executed it consumes most of the processing and this init can take over 60 seconds. Created a one-time 25s delay before executing the tests to give ML Commons time to init.

Copy link
Member

@owaiskazi19 owaiskazi19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging into this @dbwiddis! LGTM overall with few questions.

Signed-off-by: Daniel Widdis <widdis@gmail.com>
Signed-off-by: Daniel Widdis <widdis@gmail.com>
Signed-off-by: Daniel Widdis <widdis@gmail.com>
@dbwiddis
Copy link
Member Author

Still getting failed ml config index exists errors even with this delay. I'm not sure what else we can do.

Tests with failures:
 - org.opensearch.flowframework.rest.FlowFrameworkRestApiIT.testFailedUpdateWorkflow
 - org.opensearch.flowframework.rest.FlowFrameworkRestApiIT.testCreateAndProvisionRemoteModelWorkflow
 - org.opensearch.flowframework.rest.FlowFrameworkRestApiIT.testCreateAndProvisionAgentFrameworkWorkflow

@joshpalis
Copy link
Member

Seems the process node tests are leaking threads :

  2>    1) Thread[id=113, name=opensearch[org.opensearch.flowframework.workflow.ProcessNodeTests][[timer]], state=TIMED_WAITING, group=TGRP-ProcessNodeTests]
  2>    2) Thread[id=114, name=opensearch[org.opensearch.flowframework.workflow.ProcessNodeTests][opensearch_provision_workflow][T#1], state=WAITING, group=TGRP-ProcessNodeTests]
  2>    3) Thread[id=116, name=opensearch[org.opensearch.flowframework.workflow.ProcessNodeTests][opensearch_provision_workflow][T#2], state=WAITING, group=TGRP-ProcessNodeTests]
  2>    4) Thread[id=115, name=opensearch[org.opensearch.flowframework.workflow.ProcessNodeTests][scheduler][T#1], state=WAITING, group=TGRP-ProcessNodeTests]
  2>    5) Thread[id=112, name=TEST-ProcessNodeTests.testNode-seed#[2316B57D3BD543DC], state=WAITING, group=TGRP-ProcessNodeTests]

Signed-off-by: Daniel Widdis <widdis@gmail.com>
@dbwiddis
Copy link
Member Author

Seems the process node tests are leaking threads :

  2>    1) Thread[id=113, name=opensearch[org.opensearch.flowframework.workflow.ProcessNodeTests][[timer]], state=TIMED_WAITING, group=TGRP-ProcessNodeTests]
  2>    2) Thread[id=114, name=opensearch[org.opensearch.flowframework.workflow.ProcessNodeTests][opensearch_provision_workflow][T#1], state=WAITING, group=TGRP-ProcessNodeTests]
  2>    3) Thread[id=116, name=opensearch[org.opensearch.flowframework.workflow.ProcessNodeTests][opensearch_provision_workflow][T#2], state=WAITING, group=TGRP-ProcessNodeTests]
  2>    4) Thread[id=115, name=opensearch[org.opensearch.flowframework.workflow.ProcessNodeTests][scheduler][T#1], state=WAITING, group=TGRP-ProcessNodeTests]
  2>    5) Thread[id=112, name=TEST-ProcessNodeTests.testNode-seed#[2316B57D3BD543DC], state=WAITING, group=TGRP-ProcessNodeTests]

Yeah, saw that. The "scheduler" thread would have interrupted that...

Not sure that's really a problem or just where the code happened to be when the whole test suite timed out for some other reason.

In any case, I saw a lot of simplification we could do in process node after switching to action future and put that in.

Signed-off-by: Daniel Widdis <widdis@gmail.com>
@dbwiddis
Copy link
Member Author

Tested locally with threadpool max of 1 and failed on too long provisioning, so also increased the minimum "max" scale for the pools even with a small number of processors. Seemed to make the tests go a lot faster, too....

@dbwiddis dbwiddis merged commit a812e51 into opensearch-project:main Jan 30, 2024
29 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jan 30, 2024
* Add more thread pools

Signed-off-by: Daniel Widdis <widdis@gmail.com>

* Increase maxFailures to 4

Signed-off-by: Daniel Widdis <widdis@gmail.com>

* Wait before starting tests

Signed-off-by: Daniel Widdis <widdis@gmail.com>

* Improve ProcessNode timeout

Signed-off-by: Daniel Widdis <widdis@gmail.com>

* Increase minimum thread pool requirement

Signed-off-by: Daniel Widdis <widdis@gmail.com>

---------

Signed-off-by: Daniel Widdis <widdis@gmail.com>
(cherry picked from commit a812e51)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@dbwiddis dbwiddis deleted the more-thread-pools branch January 30, 2024 03:18
dbwiddis pushed a commit that referenced this pull request Jan 30, 2024
Add more thread pools (#465)

* Add more thread pools



* Increase maxFailures to 4



* Wait before starting tests



* Improve ProcessNode timeout



* Increase minimum thread pool requirement



---------


(cherry picked from commit a812e51)

Signed-off-by: Daniel Widdis <widdis@gmail.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x backport PRs to 2.x branch skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants