Add more thread pools #465

dbwiddis · 2024-01-28T05:25:16Z

Description

Spent several more hours today digging into the Multi-node integ test failures. In summary:

The first REST call to other nodes sometimes times out. Subsequent calls always seem to succeed. Added a single retry in the TestHelpers class that has caught all of these instances.
Workflow provisioning is flakily delayed by ML Commons create connector. With a shorter timeout, we just get a failed provisioning. Adding a longer timeout results in continuing on to deploy model but a failure "Deploy model failed with error : {"smSrrtMdQgKo8ZLMpnTxtA":"The ML encryption master key has not been initialized yet. Please retry after waiting for 10 seconds."}". This passes the second time with the gradle retry.

Playing around with the thread pool showed it was very limiting. Split out provisioning, deprovisioning, and retry onto separate thread pools.

I'm confident that this flakiness is due to a lot of setup work that is required in ML Commons and takes longer on the GitHub runners due to smaller numbers of threads available. Testing on a real cluster with ML Commons set up hasn't shown any issues.

Issues Resolved

More tweaks to fixes of #461

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

codecov · 2024-01-28T05:28:21Z

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (5d452f3) 71.84% compared to head (4c43204) 71.91%.

Files	Patch %	Lines
...rk/transport/ProvisionWorkflowTransportAction.java	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main     #465      +/-   ##
============================================
+ Coverage     71.84%   71.91%   +0.06%     
+ Complexity      622      621       -1     
============================================
  Files            78       78              
  Lines          3136     3133       -3     
  Branches        236      234       -2     
============================================
  Hits           2253     2253              
  Misses          776      776              
+ Partials        107      104       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dbwiddis · 2024-01-28T05:46:22Z

One Flaky failure on this PR first run: it's the awaiting for the ML config index that took over a minute. Bumped the max failures up to 4 since there are 3 such calls/test failures for this.

amitgalitz · 2024-01-29T19:08:37Z

Thanks for making these changes!
quick clarifying question:

I'm confident that this flakiness is due to a lot of setup work that is required in ML Commons and takes longer on the GitHub runners due to smaller numbers of threads available. Testing on a real cluster with ML Commons set up hasn't shown any issues.

When you say "testing on a real cluster with ml commons set up", do you mean a local multi node cluster with flow framework and ml commons and using the flow-framework APIs?

dbwiddis · 2024-01-29T20:50:18Z

"testing on a real cluster with ml commons set up", do you mean a local multi node cluster with flow framework and ml commons and using the flow-framework APIs?

No, a full EC2 cluster using OpenSearch CDK on nodes with more than 4 processors.

dbwiddis · 2024-01-29T20:58:30Z

Summary of specific things that happen on startup that this change works around:

REST client initiation is done by a superclass of the RestApiIT in a @Before method. Immediately starting tests doesn't permit this to complete on other nodes, resulting in one category of flaky tests. Waiting more than about 5 seconds before starting the tests seems to avoid this. This is fixed with both a single retry on makeRequest, and a longer startup delay. I wouldn't normally include that retry in main code but I think it's fine in a test helper class.
Create connector call sometimes takes a long time due to external calls. The existing default 10s delay was insufficient. Increased the default delay.
Creation of .ml-commons-config index doesn't happen until (1) ML Commons plugin init which starts a cron job (2) second cycle of the job running at 10s increments, meaning 20 seconds after node startup. Due to very few processors to run threads, if the full RestApiIT is executed it consumes most of the processing and this init can take over 60 seconds. Created a one-time 25s delay before executing the tests to give ML Commons time to init.

owaiskazi19

Thanks for digging into this @dbwiddis! LGTM overall with few questions.

src/test/java/org/opensearch/flowframework/rest/FlowFrameworkRestApiIT.java

src/main/java/org/opensearch/flowframework/FlowFrameworkPlugin.java

Signed-off-by: Daniel Widdis <widdis@gmail.com>

dbwiddis · 2024-01-30T00:56:22Z

Still getting failed ml config index exists errors even with this delay. I'm not sure what else we can do.

Tests with failures:
 - org.opensearch.flowframework.rest.FlowFrameworkRestApiIT.testFailedUpdateWorkflow
 - org.opensearch.flowframework.rest.FlowFrameworkRestApiIT.testCreateAndProvisionRemoteModelWorkflow
 - org.opensearch.flowframework.rest.FlowFrameworkRestApiIT.testCreateAndProvisionAgentFrameworkWorkflow

joshpalis · 2024-01-30T01:27:47Z

Seems the process node tests are leaking threads :

  2>    1) Thread[id=113, name=opensearch[org.opensearch.flowframework.workflow.ProcessNodeTests][[timer]], state=TIMED_WAITING, group=TGRP-ProcessNodeTests]
  2>    2) Thread[id=114, name=opensearch[org.opensearch.flowframework.workflow.ProcessNodeTests][opensearch_provision_workflow][T#1], state=WAITING, group=TGRP-ProcessNodeTests]
  2>    3) Thread[id=116, name=opensearch[org.opensearch.flowframework.workflow.ProcessNodeTests][opensearch_provision_workflow][T#2], state=WAITING, group=TGRP-ProcessNodeTests]
  2>    4) Thread[id=115, name=opensearch[org.opensearch.flowframework.workflow.ProcessNodeTests][scheduler][T#1], state=WAITING, group=TGRP-ProcessNodeTests]
  2>    5) Thread[id=112, name=TEST-ProcessNodeTests.testNode-seed#[2316B57D3BD543DC], state=WAITING, group=TGRP-ProcessNodeTests]

Signed-off-by: Daniel Widdis <widdis@gmail.com>

dbwiddis · 2024-01-30T01:57:42Z

Seems the process node tests are leaking threads :

  2>    1) Thread[id=113, name=opensearch[org.opensearch.flowframework.workflow.ProcessNodeTests][[timer]], state=TIMED_WAITING, group=TGRP-ProcessNodeTests]
  2>    2) Thread[id=114, name=opensearch[org.opensearch.flowframework.workflow.ProcessNodeTests][opensearch_provision_workflow][T#1], state=WAITING, group=TGRP-ProcessNodeTests]
  2>    3) Thread[id=116, name=opensearch[org.opensearch.flowframework.workflow.ProcessNodeTests][opensearch_provision_workflow][T#2], state=WAITING, group=TGRP-ProcessNodeTests]
  2>    4) Thread[id=115, name=opensearch[org.opensearch.flowframework.workflow.ProcessNodeTests][scheduler][T#1], state=WAITING, group=TGRP-ProcessNodeTests]
  2>    5) Thread[id=112, name=TEST-ProcessNodeTests.testNode-seed#[2316B57D3BD543DC], state=WAITING, group=TGRP-ProcessNodeTests]

Yeah, saw that. The "scheduler" thread would have interrupted that...

Not sure that's really a problem or just where the code happened to be when the whole test suite timed out for some other reason.

In any case, I saw a lot of simplification we could do in process node after switching to action future and put that in.

Signed-off-by: Daniel Widdis <widdis@gmail.com>

dbwiddis · 2024-01-30T02:18:24Z

Tested locally with threadpool max of 1 and failed on too long provisioning, so also increased the minimum "max" scale for the pools even with a small number of processors. Seemed to make the tests go a lot faster, too....

* Add more thread pools Signed-off-by: Daniel Widdis <widdis@gmail.com> * Increase maxFailures to 4 Signed-off-by: Daniel Widdis <widdis@gmail.com> * Wait before starting tests Signed-off-by: Daniel Widdis <widdis@gmail.com> * Improve ProcessNode timeout Signed-off-by: Daniel Widdis <widdis@gmail.com> * Increase minimum thread pool requirement Signed-off-by: Daniel Widdis <widdis@gmail.com> --------- Signed-off-by: Daniel Widdis <widdis@gmail.com> (cherry picked from commit a812e51) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Add more thread pools (#465) * Add more thread pools * Increase maxFailures to 4 * Wait before starting tests * Improve ProcessNode timeout * Increase minimum thread pool requirement --------- (cherry picked from commit a812e51) Signed-off-by: Daniel Widdis <widdis@gmail.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

dbwiddis requested review from owaiskazi19, joshpalis, ohltyler, amitgalitz and jackiehanyang as code owners January 28, 2024 05:25

dbwiddis added the skip-changelog label Jan 28, 2024

github-actions bot added the backport 2.x backport PRs to 2.x branch label Jan 28, 2024

dbwiddis force-pushed the more-thread-pools branch 3 times, most recently from 9cbc7a4 to f36f0b7 Compare January 29, 2024 17:03

joshpalis approved these changes Jan 29, 2024

View reviewed changes

owaiskazi19 reviewed Jan 29, 2024

View reviewed changes

src/test/java/org/opensearch/flowframework/rest/FlowFrameworkRestApiIT.java Show resolved Hide resolved

src/main/java/org/opensearch/flowframework/FlowFrameworkPlugin.java Show resolved Hide resolved

dbwiddis added 3 commits January 29, 2024 16:39

Add more thread pools

cba0c5b

Signed-off-by: Daniel Widdis <widdis@gmail.com>

Increase maxFailures to 4

5005b6e

Signed-off-by: Daniel Widdis <widdis@gmail.com>

Wait before starting tests

1fb3140

Signed-off-by: Daniel Widdis <widdis@gmail.com>

dbwiddis force-pushed the more-thread-pools branch from f36f0b7 to 1fb3140 Compare January 30, 2024 00:39

Improve ProcessNode timeout

2edbc29

Signed-off-by: Daniel Widdis <widdis@gmail.com>

Increase minimum thread pool requirement

4c43204

Signed-off-by: Daniel Widdis <widdis@gmail.com>

owaiskazi19 approved these changes Jan 30, 2024

View reviewed changes

dbwiddis merged commit a812e51 into opensearch-project:main Jan 30, 2024
29 checks passed

opensearch-trigger-bot bot mentioned this pull request Jan 30, 2024

[Backport 2.x] Add more thread pools #473

Merged

dbwiddis deleted the more-thread-pools branch January 30, 2024 03:18

dbwiddis mentioned this pull request Feb 5, 2024

[BUG] AbstractRetryableWorkflowStep doesn't handle edge case where initial task is already complete. #493

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more thread pools #465

Add more thread pools #465

dbwiddis commented Jan 28, 2024

codecov bot commented Jan 28, 2024 •

edited

Loading

dbwiddis commented Jan 28, 2024

amitgalitz commented Jan 29, 2024

dbwiddis commented Jan 29, 2024

dbwiddis commented Jan 29, 2024

owaiskazi19 left a comment •

edited

Loading

dbwiddis commented Jan 30, 2024

joshpalis commented Jan 30, 2024

dbwiddis commented Jan 30, 2024

dbwiddis commented Jan 30, 2024

Add more thread pools #465

Add more thread pools #465

Conversation

dbwiddis commented Jan 28, 2024

Description

Issues Resolved

codecov bot commented Jan 28, 2024 • edited Loading

Codecov Report

dbwiddis commented Jan 28, 2024

amitgalitz commented Jan 29, 2024

dbwiddis commented Jan 29, 2024

dbwiddis commented Jan 29, 2024

owaiskazi19 left a comment • edited Loading

Choose a reason for hiding this comment

dbwiddis commented Jan 30, 2024

joshpalis commented Jan 30, 2024

dbwiddis commented Jan 30, 2024

dbwiddis commented Jan 30, 2024

codecov bot commented Jan 28, 2024 •

edited

Loading

owaiskazi19 left a comment •

edited

Loading