Skip to content

[CI] ReactiveStorageIT testScaleWhileShrinking failing #122119

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
elasticsearchmachine opened this issue Feb 8, 2025 · 13 comments · Fixed by #123569
Closed

[CI] ReactiveStorageIT testScaleWhileShrinking failing #122119

elasticsearchmachine opened this issue Feb 8, 2025 · 13 comments · Fixed by #123569
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. low-risk An open issue or test failure that is a low risk to future releases Team:Distributed Indexing Meta label for Distributed Indexing team >test-failure Triaged test failures from CI

Comments

@elasticsearchmachine
Copy link
Collaborator

elasticsearchmachine commented Feb 8, 2025

Build Scans:

Reproduction Line:

gradlew ":x-pack:plugin:autoscaling:internalClusterTest" --tests "org.elasticsearch.xpack.autoscaling.storage.ReactiveStorageIT.testScaleWhileShrinking" -Dtests.seed=A8072C4149FB3248 -Dtests.locale=lg-UG -Dtests.timezone=America/Boa_Vista -Druntime.java=23

Applicable branches:
main

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

java.lang.IllegalStateException: Some shards are still open after the threadpool terminated. Something is leaking index readers or store references.

Issue Reasons:

  • [main] 3 failures in test testScaleWhileShrinking (0.4% fail rate in 761 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

@elasticsearchmachine elasticsearchmachine added :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >test-failure Triaged test failures from CI needs:risk Requires assignment of a risk label (low, medium, blocker) Team:Distributed Coordination Meta label for Distributed Coordination team labels Feb 8, 2025
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@nicktindall nicktindall added medium-risk An open issue or test failure that is a medium risk to future releases and removed needs:risk Requires assignment of a risk label (low, medium, blocker) labels Feb 11, 2025
@nicktindall
Copy link
Contributor

Marking this as medium because it's suggesting we have a resource leak.

@nicktindall nicktindall added low-risk An open issue or test failure that is a low risk to future releases and removed medium-risk An open issue or test failure that is a medium risk to future releases labels Feb 11, 2025
@nicktindall
Copy link
Contributor

Updated to low, it appears to be a windows thing

@nicktindall nicktindall added :Core/Infra/Core Core issues without another label medium-risk An open issue or test failure that is a medium risk to future releases low-risk An open issue or test failure that is a low risk to future releases and removed :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) low-risk An open issue or test failure that is a low risk to future releases medium-risk An open issue or test failure that is a medium risk to future releases labels Feb 11, 2025
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-core-infra (Team:Core/Infra)

@elasticsearchmachine elasticsearchmachine added Team:Core/Infra Meta label for core/infra team and removed Team:Distributed Coordination Meta label for Distributed Coordination team labels Feb 11, 2025
@nicktindall
Copy link
Contributor

nicktindall commented Feb 11, 2025

Assigned to core-infra off the back of conversation on #121716 (comment)

Please feel free to send back if you think this is not the same root cause

@nicktindall
Copy link
Contributor

Also includes

1> java.io.IOException: could not remove the following files (in the order of attempts):
  1>    C:\bk\x-pack\plugin\autoscaling\build\testrun\internalClusterTest\temp\org.elasticsearch.xpack.autoscaling.storage.ReactiveStorageIT_73BD10C01CD081CC-001\tempDir-003\node-1\indices\9XlEdfb6TKGpWBFATp0Knw\0\index\_2.cfs: java.nio.file.AccessDeniedException: C:\bk\x-pack\plugin\autoscaling\build\testrun\internalClusterTest\temp\org.elasticsearch.xpack.autoscaling.storage.ReactiveStorageIT_73BD10C01CD081CC-001\tempDir-003\node-1\indices\9XlEdfb6TKGpWBFATp0Knw\0\index\_2.cfs

in the logs

@ldematte
Copy link
Contributor

Actually @nicktindall I'm sending this back, as we think the cleanup failure is a red-herring, and the root cause is that the node can't close because of reference leaks, that in turn causes test cleanup to fail because the node is still running.

@ldematte ldematte added :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) and removed :Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team labels Feb 11, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Feb 11, 2025
@nicktindall nicktindall added medium-risk An open issue or test failure that is a medium risk to future releases and removed low-risk An open issue or test failure that is a low risk to future releases labels Feb 13, 2025
@nicktindall
Copy link
Contributor

Bumped this one back up to medium risk as it might indicate a resource leak

@ywangd ywangd self-assigned this Feb 14, 2025
@ywangd ywangd added :Core/Infra/Core Core issues without another label and removed :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Feb 14, 2025
@elasticsearchmachine elasticsearchmachine added Team:Core/Infra Meta label for core/infra team and removed Team:Distributed Coordination Meta label for Distributed Coordination team labels Feb 14, 2025
@ywangd
Copy link
Member

ywangd commented Feb 14, 2025

This is the same issue as #121717. See here for the analysis for the core-infra label.

@ywangd ywangd added low-risk An open issue or test failure that is a low risk to future releases and removed medium-risk An open issue or test failure that is a medium risk to future releases labels Feb 14, 2025
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch main

Mute Reasons:

  • [main] 3 failures in test testScaleWhileShrinking (0.4% fail rate in 761 executions)

Build Scans:

@ywangd ywangd removed their assignment Feb 16, 2025
@ldematte
Copy link
Contributor

See #121717 (comment) for the reason behind the reassignement

@ldematte ldematte added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. and removed :Core/Infra/Core Core issues without another label labels Feb 18, 2025
@elasticsearchmachine elasticsearchmachine added Team:Distributed Indexing Meta label for Distributed Indexing team and removed Team:Core/Infra Meta label for core/infra team labels Feb 18, 2025
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

@arteam
Copy link
Contributor

arteam commented Feb 20, 2025

Following the discussion on #121717 for propertly shutting down nodes

ywangd added a commit to ywangd/elasticsearch that referenced this issue Feb 27, 2025
When IndicesService is closed, the pending deletion may still be in
progress due to indices removed before IndicesService gets closed. If
the deletion stucks for some reason, it can stall the node shutdown.
This PR aborts the pending deletion more promptly by not retry after
IndicesService is closed.

Resolves: elastic#121717, elastic#121716, elastic#122119
ywangd added a commit to ywangd/elasticsearch that referenced this issue Feb 28, 2025
When IndicesService is closed, the pending deletion may still be in
progress due to indices removed before IndicesService gets closed. If
the deletion stucks for some reason, it can stall the node shutdown.
This PR aborts the pending deletion more promptly by not retry after
IndicesService is stopped.

Resolves: elastic#121717 Resolves: elastic#121716  Resolves: elastic#122119
(cherry picked from commit c7e7dbe)

# Conflicts:
#	muted-tests.yml
ywangd added a commit to ywangd/elasticsearch that referenced this issue Feb 28, 2025
When IndicesService is closed, the pending deletion may still be in
progress due to indices removed before IndicesService gets closed. If
the deletion stucks for some reason, it can stall the node shutdown.
This PR aborts the pending deletion more promptly by not retry after
IndicesService is stopped.

Resolves: elastic#121717 Resolves: elastic#121716  Resolves: elastic#122119
(cherry picked from commit c7e7dbe)

# Conflicts:
#	muted-tests.yml
elasticsearchmachine pushed a commit that referenced this issue Feb 28, 2025
When IndicesService is closed, the pending deletion may still be in
progress due to indices removed before IndicesService gets closed. If
the deletion stucks for some reason, it can stall the node shutdown.
This PR aborts the pending deletion more promptly by not retry after
IndicesService is stopped.

Resolves: #121717 Resolves: #121716  Resolves: #122119
(cherry picked from commit c7e7dbe)

# Conflicts:
#	muted-tests.yml
elasticsearchmachine pushed a commit that referenced this issue Feb 28, 2025
When IndicesService is closed, the pending deletion may still be in
progress due to indices removed before IndicesService gets closed. If
the deletion stucks for some reason, it can stall the node shutdown.
This PR aborts the pending deletion more promptly by not retry after
IndicesService is stopped.

Resolves: #121717 Resolves: #121716  Resolves: #122119
(cherry picked from commit c7e7dbe)

# Conflicts:
#	muted-tests.yml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. low-risk An open issue or test failure that is a low risk to future releases Team:Distributed Indexing Meta label for Distributed Indexing team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants