[Release-3.14.1][Test] Add integration test for the fixes of issues caused by cluster update and rollback failure #7150

hehe7318 · 2025-12-15T03:07:31Z

Description of changes

Add integration test to verify the following fixes work correctly:

[F1] clustermgtd remains running after both update and rollback fail
[F2] cfn-hup does not enter endless loop after rollback to state older than 24h
[F3] dna.json files are cleaned up after update failure

Test scenario:

Create cluster with 3 static compute nodes
Inject cfn-signal failure on head node (simulating expired wait condition)
Disable cfn-hup on CN1 before update (causes update to fail)
Trigger cluster update (add new queue)
Wait for CN2 to apply update, then disable its cfn-hup
Update fails (CN1 didn't update), rollback fails (CN2 won't rollback)
Verify: clustermgtd running, dna.json cleaned up, CN3 has correct config version, metadata_db.json updated, no cfn-hup endless loop

Tests

Running and debugging

References

Link to impacted open issues.
Link to related PRs in other packages (i.e. cookbook, node).
Link to documentation useful to understand the changes.

Checklist

Make sure you are pointing to the right branch.
If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
Check all commits' messages are clear, describing what and why vs how.
Make sure to have added unit tests or integration tests to cover the new/modified code.
Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

… and rollback failure Add integration test to verify the following fixes work correctly: - [F1] clustermgtd remains running after both update and rollback fail - [F2] cfn-hup does not enter endless loop after rollback to state older than 24h - [F3] dna.json files are cleaned up after update failure Test scenario: 1. Create cluster with 3 static compute nodes 2. Inject cfn-signal failure on head node (simulating expired wait condition) 3. Disable cfn-hup on CN1 before update (causes update to fail) 4. Trigger cluster update (add new queue) 5. Wait for CN2 to apply update, then disable its cfn-hup 6. Update fails (CN1 didn't update), rollback fails (CN2 won't rollback) 7. Verify: clustermgtd running, dna.json cleaned up, CN3 has correct config version, metadata_db.json updated, no cfn-hup endless loop

- Fix an error, now use supervisorctl instead of systemctl to stop cfn-hup on compute nodes - Use srun to execute commands on compute nodes instead of SSM - Add retry logic for clustermgtd running verification (10 min timeout) - Only wait for UPDATE_ROLLBACK_COMPLETE state instead of multiple final states

supervisorctl status returns exit code 3 when a process is in STOPPED state. Use raise_on_error=False when verifying cfn-hup is stopped.

gmarciani · 2025-12-15T19:32:29Z

tests/integration-tests/configs/develop.yaml

          instances: {{ common.INSTANCES_DEFAULT_X86 }}
          oss: [{{ OS_X86_3 }}]
          schedulers: ["slurm"]
+    test_update_rollback_fixes.py::test_update_rollback_fixes:


What about naming the test test_update_rollback_failure?

Makes sense!

gmarciani · 2025-12-15T19:33:38Z

...sts/update/test_update_rollback_fixes/test_update_rollback_fixes/pcluster.config.update.yaml

+Scheduling:
+  Scheduler: slurm
+  SlurmSettings:
+    ScaledownIdletime: 60


Here and in other parts of the config.
Is this meaningful for the test? If it is, why? If not, let's remove it and use the default value.

Thanks for catching the test configuration file issues, I copy pasted these configs from some test_slurm tests config, I wanted to check them before push the changes to GH but forgot.:blush:

Removing

gmarciani · 2025-12-15T19:34:32Z

...sts/update/test_update_rollback_fixes/test_update_rollback_fixes/pcluster.config.update.yaml

+      ComputeSettings:
+        LocalStorage:
+          RootVolume:
+            Size: 45


Here and in other parts of the config.
Is this meaningful for the test? If it is, why? If not, let's remove it and use the default value.

gmarciani · 2025-12-15T19:36:54Z

...update/test_update_rollback_failure/test_update_rollback_failure/pcluster.config.update.yaml

+    KeyName: {{ key_name }}
+  Iam:
+    AdditionalIamPolicies:
+      - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore


Here and in other parts of the config.
Adding this policy is not required as it is automatically injected by our test framework.
See https://github.com/aws/aws-parallelcluster/blob/release-3.14/tests/integration-tests/conftest.py#L858-L866

gmarciani · 2025-12-15T19:37:59Z

...sts/update/test_update_rollback_fixes/test_update_rollback_fixes/pcluster.config.update.yaml

+      Networking:
+        SubnetIds:
+          - {{ private_subnet_id }}
+Monitoring:


Here an in the other config files, why we need to explicitly set this to true?
If there is not a valid reason, let's remove this params and rely on the default value, which is true.

gmarciani · 2025-12-15T19:41:15Z

tests/integration-tests/tests/update/test_update_rollback_fixes.py

+This test validates the following fixes:
+- [F1] clustermgtd remains running after both update and rollback fail
+- [F2] cfn-hup does not enter an endless loop after rollback to a state older than 24h
+- [F3] dna.json files are cleaned up after update failure


[minor] Worth mentioning :update and rollback failure

gmarciani · 2025-12-15T19:43:03Z

tests/integration-tests/tests/update/test_update_rollback_fixes.py

+Integration tests for verifying fixes related to cluster update rollback scenarios.
+
+This test validates the following fixes:
+- [F1] clustermgtd remains running after both update and rollback fail


[minor] Worth mentioning: we expect it to be running if the update and rollback fail in the section of the update that we consider safe, which is after the slurm reconfiguration

gmarciani · 2025-12-15T19:43:56Z

tests/integration-tests/tests/update/test_update_rollback_fixes.py

+9. Verify fixes:
+   - clustermgtd is running
+   - dna.json files are deleted
+   - CN3 (healthy node) has correct config version (source config before update)


LEt's also capture what we expect for the unhealthy nodes

gmarciani · 2025-12-15T19:44:59Z

tests/integration-tests/tests/update/test_update_rollback_fixes.py

+    scheduler_commands_factory,
+):
+    """
+    Test that cluster update rollback fixes work correctly.


This is a repetition of what we documented in the comment aboie. Let's consolidate the comments into one comprehensive description to be put only here

gmarciani · 2025-12-15T19:46:44Z

tests/integration-tests/tests/update/test_update_rollback_fixes.py

+
+    # Get compute node hostnames
+    compute_nodes = slurm_commands.get_compute_nodes()
+    assert_that(len(compute_nodes)).is_greater_than_or_equal_to(3)


Why is this assertion required if we already have _wait_for_static_nodes_ready(slurm_commands, expected_count=3) above?
If not required, let's remove it

gmarciani · 2025-12-15T19:54:14Z

tests/integration-tests/tests/update/test_update_rollback_failure.py

+    logger.info(f"CN3: {cn3} -> {cn3_instance_id}")
+
+    # Get initial config version from DynamoDB (dna.json is cleaned up after successful create)
+    initial_config_version = _get_config_version_from_ddb(region, cluster.name, cn3_instance_id)


why retrieving it from the DDB record of a specific compute rather than from the head node?

What do you mean, I didn't get it. dna.json is cleaned up on HeadNode after cluster create, we can not get the config version from it.

Oh I got you, you mean get record from DDB, but with HeadNode id. What's the difference? The goal of the test is to verify whether the compute node (CN3) has the correct configuration version after the rollback. So, I think if we directly get the baseline value from the DDB record with CN3 ID, and then verify whether it remains unchanged, it makes the test logic clearer.

gmarciani · 2025-12-15T19:57:18Z

tests/integration-tests/tests/update/test_update_rollback_fixes.py

+
+    # Get the target config version from the update config file
+    # We'll use this to verify CN2 has applied the update before disabling its cfn-hup
+    cluster.update(str(updated_config_file), wait=False, raise_on_error=False, log_error=False)


Why log_error=False? even if we want to ignore the failure b/c the failure is expected, it could be useful to emit the error that triggers the expected failure.

Done, removed, thank you for pointing this out

gmarciani · 2025-12-15T19:58:55Z

tests/integration-tests/tests/update/test_update_rollback_fixes.py

+        region, cluster.name, cn2_instance_id, initial_config_version, timeout_minutes=15
+    )
+
+    logger.info(f"CN2 has applied the update. Disabling cfn-hup on CN2 ({cn2})...")


Let's specify in this log line that we are doing this way to inject rollback failure. In this way it will be more clear the goal when we will go through the test logs.

gmarciani · 2025-12-15T20:00:30Z

tests/integration-tests/tests/update/test_update_rollback_failure.py

+
+    # Wait for stack to reach UPDATE_ROLLBACK_COMPLETE state
+    logger.info("Waiting for stack to reach UPDATE_ROLLBACK_COMPLETE...")
+    final_status = _wait_for_stack_rollback_complete(cluster, region)


The stack reaches the state UPDATE_ROLLBACK_COMPLETE before the actual rollback completed in the head node. This is a known limitation of our rollback. We should consider this before executing the assertions below, otherwise we risk false positive.

Makes a lot of sense, I added a new logic that check the last line of chef-client log, to ensure chef finished.

gmarciani · 2025-12-15T20:02:16Z

tests/integration-tests/tests/update/test_update_rollback_fixes.py

+    logger.info(f"cfn-hup stopped on {node_name} ✓")
+
+
+def _wait_for_stack_rollback_complete(cluster, region, timeout_minutes=60):


Why do we need to nest functions here? If the only reason is to inject the timeout_minutes parameter, I suggest to simplify by removing the nested functions and use a static timeout that we think it is reasonable.

gmarciani · 2025-12-15T20:04:29Z

tests/integration-tests/tests/update/test_update_rollback_fixes.py

+    result = remote_command_executor.run_remote_command(
+        "find /opt/parallelcluster -name supervisorctl -type f 2>/dev/null | head -1"
+    )
+    supervisorctl_path = result.stdout.strip()
+
+    if not supervisorctl_path:
+        supervisorctl_path = "/opt/parallelcluster/pyenv/versions/3.12.11/envs/cookbook_virtualenv/bin/supervisorctl"


This logic to call supervisorctl is duplicated in this test. Let's move it to an utility function.

gmarciani · 2025-12-15T20:11:11Z

tests/integration-tests/tests/update/test_update_rollback_failure.py

+    """Verify that metadata_db.json was updated (cfn-hup processed the change)."""
+    logger.info("Verifying metadata_db.json is updated...")
+
+    result = remote_command_executor.run_remote_command(


This is about checking existence of a file in a node.
This is a logic that, if it does not exist yet, it would be helpful to other tests as well.
I suggest to move to an utility function that can be used by other tests.

It's just two lines and we have many options to check if a file exists, I don't think people will check util.py first before they write their own way to check if it exists. Let's skip this.

gmarciani · 2025-12-15T20:11:33Z

tests/integration-tests/tests/update/test_update_rollback_failure.py

+    """Verify that dna.json files are cleaned up after update failure."""
+    logger.info("Verifying dna.json files are cleaned up...")
+
+    result = remote_command_executor.run_remote_command(


This is about checking existence of a files in a node.
This is a logic that, if it does not exist yet, it would be helpful to other tests as well.
I suggest to move to an utility function that can be used by other tests.

It's just two lines and we have many options to check if a file exists, I don't think people will check util.py first before they write their own way to check if it exists. Let's skip this.

gmarciani · 2025-12-15T20:12:01Z

tests/integration-tests/tests/update/test_update_rollback_failure.py

+    return _check_version()
+
+
+def _verify_compute_node_config_version_in_ddb(region, cluster_name, instance_id, expected_version):


This is about checking the config version deployed to a cluster node.
This is a logic that, if it does not exist yet, it would be helpful to other tests as well.
I suggest to move to an utility function that can be used by other tests in future.

Makes sense. Done

gmarciani · 2025-12-15T20:12:41Z

tests/integration-tests/tests/update/test_update_rollback_failure.py

+    assert_that(result.stdout.strip()).is_equal_to("exists")
+
+    # Also check the modification time is recent (within last hour)
+    result = remote_command_executor.run_remote_command("stat -c %Y /var/lib/cfn-hup/data/metadata_db.json")


This is about checking update time of a file in a node.
This is a logic that, if it does not exist yet, it would be helpful to other tests as well.
I suggest to move to an utility function that can be used by other tests.

…f cluster readiness check takes more than 10 mins

gmarciani · 2025-12-15T21:30:40Z

tests/integration-tests/tests/update/test_update_rollback_fixes.py

+    logger.info(f"cfn-hup stopped on {node_name} ✓")
+
+
+def _wait_for_stack_rollback_complete(cluster, region, timeout_minutes=60):


I am a little bit concerned by the duration of the test. Can we shorten the 60min titomeout?

Shorten to 30 mins

gmarciani · 2025-12-15T21:32:37Z

tests/integration-tests/tests/update/test_update_rollback_failure.py

+    slurm_commands = SlurmCommands(remote_command_executor)
+
+    # Wait for all static nodes to be ready
+    _wait_for_static_nodes_ready(slurm_commands, expected_count=3)


We refer to 3 in multiple places. Can we define a variable, e.g. n_static_nodes = 3 and use it everywhere we need it, including the cluster config?

Thank you! Done!

gmarciani · 2025-12-15T21:33:01Z

...update/test_update_rollback_failure/test_update_rollback_failure/pcluster.config.update.yaml

+          Instances:
+            - InstanceType: c5.large
+          MinCount: 3
+          MaxCount: 5


Why 2 dynamic nodes? It seems that the test only needs the 3 static ones

@Retry

- Rename test from test_update_rollback_fixes to test_update_rollback_failure - Consolidate documentation into function docstring - Add _get_supervisorctl_path utility function to reduce code duplication - Simplify retry logic by using @Retry decorator directly on functions - Add _wait_for_head_node_rollback_complete to ensure rollback recipe finishes before running assertions (CFN stack completes before head node) - Remove redundant assertions and improve log messages - Remove log_error=False

- Move verify_cluster_node_config_version_in_ddb to utils.py for reuse - Add get_file_mtime_age_seconds utility function to utils.py - Define N_STATIC_NODES constant and use template variable in configs - Modify the timeout of rollback to 30 minutes

…f modification time - Add 5-minute retry loop for metadata_db.json existence check since the file may be temporarily removed during cfn-hup update - Reduce the check of modification time from 1 hour to 10 minutes

hehe7318 added 2 commits December 11, 2025 16:31

Add test in develop.yaml

ac20b5f

hehe7318 requested review from a team as code owners December 15, 2025 03:07

hehe7318 added the 3.x label Dec 15, 2025

hehe7318 added 3 commits December 15, 2025 01:29

fix: move config files to correct test_datadir structure

83b08d9

fix the cfn-singal path

8715cbf

hehe7318 added the skip-changelog-update Disables the check that enforces changelog updates in PRs label Dec 15, 2025

hehe7318 changed the title ~~Add integration test for the fixes of issues caused by cluster update and rollback failure~~ [Release-3.14.1][]Add integration test for the fixes of issues caused by cluster update and rollback failure Dec 15, 2025

hehe7318 changed the title ~~[Release-3.14.1][]Add integration test for the fixes of issues caused by cluster update and rollback failure~~ [Release-3.14.1][Test] Add integration test for the fixes of issues caused by cluster update and rollback failure Dec 15, 2025

fix: handle supervisorctl exit code 3 when cfn-hup is stopped

f807615

supervisorctl status returns exit code 3 when a process is in STOPPED state. Use raise_on_error=False when verifying cfn-hup is stopped.

gmarciani reviewed Dec 15, 2025

View reviewed changes

Increase checking of clustermgtd RUNNING time to 20 minutes because o…

2597588

…f cluster readiness check takes more than 10 mins

gmarciani reviewed Dec 15, 2025

View reviewed changes

hehe7318 added 4 commits December 15, 2025 17:40

Correct the test name in develop.yaml test config

2400546

Address comments:

9454e50

- Move verify_cluster_node_config_version_in_ddb to utils.py for reuse - Add get_file_mtime_age_seconds utility function to utils.py - Define N_STATIC_NODES constant and use template variable in configs - Modify the timeout of rollback to 30 minutes

		logger.info(f"cfn-hup stopped on {node_name} ✓")


		def _wait_for_stack_rollback_complete(cluster, region, timeout_minutes=60):

		return _check_version()


		def _verify_compute_node_config_version_in_ddb(region, cluster_name, instance_id, expected_version):

[Release-3.14.1][Test] Add integration test for the fixes of issues caused by cluster update and rollback failure #7150

Are you sure you want to change the base?

[Release-3.14.1][Test] Add integration test for the fixes of issues caused by cluster update and rollback failure #7150

Conversation

hehe7318 commented Dec 15, 2025

Description of changes

Tests

References

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gmarciani Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gmarciani Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gmarciani Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gmarciani Dec 15, 2025 •

edited

Loading

gmarciani Dec 15, 2025 •

edited

Loading

gmarciani Dec 15, 2025 •

edited

Loading