Skip to content

Conversation

@hehe7318
Copy link
Contributor

Description of changes

Add integration test to verify the following fixes work correctly:

  • [F1] clustermgtd remains running after both update and rollback fail
  • [F2] cfn-hup does not enter endless loop after rollback to state older than 24h
  • [F3] dna.json files are cleaned up after update failure

Test scenario:

  1. Create cluster with 3 static compute nodes
  2. Inject cfn-signal failure on head node (simulating expired wait condition)
  3. Disable cfn-hup on CN1 before update (causes update to fail)
  4. Trigger cluster update (add new queue)
  5. Wait for CN2 to apply update, then disable its cfn-hup
  6. Update fails (CN1 didn't update), rollback fails (CN2 won't rollback)
  7. Verify: clustermgtd running, dna.json cleaned up, CN3 has correct config version, metadata_db.json updated, no cfn-hup endless loop

Tests

  • Running and debugging

References

  • Link to impacted open issues.
  • Link to related PRs in other packages (i.e. cookbook, node).
  • Link to documentation useful to understand the changes.

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

… and rollback failure

Add integration test to verify the following fixes work correctly:
- [F1] clustermgtd remains running after both update and rollback fail
- [F2] cfn-hup does not enter endless loop after rollback to state older than 24h
- [F3] dna.json files are cleaned up after update failure

Test scenario:
1. Create cluster with 3 static compute nodes
2. Inject cfn-signal failure on head node (simulating expired wait condition)
3. Disable cfn-hup on CN1 before update (causes update to fail)
4. Trigger cluster update (add new queue)
5. Wait for CN2 to apply update, then disable its cfn-hup
6. Update fails (CN1 didn't update), rollback fails (CN2 won't rollback)
7. Verify: clustermgtd running, dna.json cleaned up, CN3 has correct
   config version, metadata_db.json updated, no cfn-hup endless loop
@hehe7318 hehe7318 requested review from a team as code owners December 15, 2025 03:07
@hehe7318 hehe7318 added the 3.x label Dec 15, 2025
- Fix an error, now use supervisorctl instead of systemctl to stop cfn-hup on compute nodes
- Use srun to execute commands on compute nodes instead of SSM
- Add retry logic for clustermgtd running verification (10 min timeout)
- Only wait for UPDATE_ROLLBACK_COMPLETE state instead of multiple final states
@hehe7318 hehe7318 added the skip-changelog-update Disables the check that enforces changelog updates in PRs label Dec 15, 2025
@hehe7318 hehe7318 changed the title Add integration test for the fixes of issues caused by cluster update and rollback failure [Release-3.14.1][]Add integration test for the fixes of issues caused by cluster update and rollback failure Dec 15, 2025
@hehe7318 hehe7318 changed the title [Release-3.14.1][]Add integration test for the fixes of issues caused by cluster update and rollback failure [Release-3.14.1][Test] Add integration test for the fixes of issues caused by cluster update and rollback failure Dec 15, 2025
supervisorctl status returns exit code 3 when a process is in STOPPED
state. Use raise_on_error=False when verifying cfn-hup is stopped.
instances: {{ common.INSTANCES_DEFAULT_X86 }}
oss: [{{ OS_X86_3 }}]
schedulers: ["slurm"]
test_update_rollback_fixes.py::test_update_rollback_fixes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about naming the test test_update_rollback_failure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense!

Scheduling:
Scheduler: slurm
SlurmSettings:
ScaledownIdletime: 60
Copy link
Contributor

@gmarciani gmarciani Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and in other parts of the config.
Is this meaningful for the test? If it is, why? If not, let's remove it and use the default value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching the test configuration file issues, I copy pasted these configs from some test_slurm tests config, I wanted to check them before push the changes to GH but forgot.:blush:

Removing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

ComputeSettings:
LocalStorage:
RootVolume:
Size: 45
Copy link
Contributor

@gmarciani gmarciani Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and in other parts of the config.
Is this meaningful for the test? If it is, why? If not, let's remove it and use the default value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed!

KeyName: {{ key_name }}
Iam:
AdditionalIamPolicies:
- Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
Copy link
Contributor

@gmarciani gmarciani Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and in other parts of the config.
Adding this policy is not required as it is automatically injected by our test framework.
See https://github.com/aws/aws-parallelcluster/blob/release-3.14/tests/integration-tests/conftest.py#L858-L866

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed!

Networking:
SubnetIds:
- {{ private_subnet_id }}
Monitoring:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here an in the other config files, why we need to explicitly set this to true?
If there is not a valid reason, let's remove this params and rely on the default value, which is true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed!

This test validates the following fixes:
- [F1] clustermgtd remains running after both update and rollback fail
- [F2] cfn-hup does not enter an endless loop after rollback to a state older than 24h
- [F3] dna.json files are cleaned up after update failure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] Worth mentioning :update and rollback failure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Integration tests for verifying fixes related to cluster update rollback scenarios.
This test validates the following fixes:
- [F1] clustermgtd remains running after both update and rollback fail
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] Worth mentioning: we expect it to be running if the update and rollback fail in the section of the update that we consider safe, which is after the slurm reconfiguration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added!

9. Verify fixes:
- clustermgtd is running
- dna.json files are deleted
- CN3 (healthy node) has correct config version (source config before update)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LEt's also capture what we expect for the unhealthy nodes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

scheduler_commands_factory,
):
"""
Test that cluster update rollback fixes work correctly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a repetition of what we documented in the comment aboie. Let's consolidate the comments into one comprehensive description to be put only here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


# Get compute node hostnames
compute_nodes = slurm_commands.get_compute_nodes()
assert_that(len(compute_nodes)).is_greater_than_or_equal_to(3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this assertion required if we already have _wait_for_static_nodes_ready(slurm_commands, expected_count=3) above?
If not required, let's remove it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

logger.info(f"CN3: {cn3} -> {cn3_instance_id}")

# Get initial config version from DynamoDB (dna.json is cleaned up after successful create)
initial_config_version = _get_config_version_from_ddb(region, cluster.name, cn3_instance_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why retrieving it from the DDB record of a specific compute rather than from the head node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean, I didn't get it. dna.json is cleaned up on HeadNode after cluster create, we can not get the config version from it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I got you, you mean get record from DDB, but with HeadNode id. What's the difference? The goal of the test is to verify whether the compute node (CN3) has the correct configuration version after the rollback. So, I think if we directly get the baseline value from the DDB record with CN3 ID, and then verify whether it remains unchanged, it makes the test logic clearer.


# Get the target config version from the update config file
# We'll use this to verify CN2 has applied the update before disabling its cfn-hup
cluster.update(str(updated_config_file), wait=False, raise_on_error=False, log_error=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why log_error=False? even if we want to ignore the failure b/c the failure is expected, it could be useful to emit the error that triggers the expected failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, removed, thank you for pointing this out

region, cluster.name, cn2_instance_id, initial_config_version, timeout_minutes=15
)

logger.info(f"CN2 has applied the update. Disabling cfn-hup on CN2 ({cn2})...")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's specify in this log line that we are doing this way to inject rollback failure. In this way it will be more clear the goal when we will go through the test logs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


# Wait for stack to reach UPDATE_ROLLBACK_COMPLETE state
logger.info("Waiting for stack to reach UPDATE_ROLLBACK_COMPLETE...")
final_status = _wait_for_stack_rollback_complete(cluster, region)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stack reaches the state UPDATE_ROLLBACK_COMPLETE before the actual rollback completed in the head node. This is a known limitation of our rollback. We should consider this before executing the assertions below, otherwise we risk false positive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes a lot of sense, I added a new logic that check the last line of chef-client log, to ensure chef finished.

logger.info(f"cfn-hup stopped on {node_name} ✓")


def _wait_for_stack_rollback_complete(cluster, region, timeout_minutes=60):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to nest functions here? If the only reason is to inject the timeout_minutes parameter, I suggest to simplify by removing the nested functions and use a static timeout that we think it is reasonable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 310 to 316
result = remote_command_executor.run_remote_command(
"find /opt/parallelcluster -name supervisorctl -type f 2>/dev/null | head -1"
)
supervisorctl_path = result.stdout.strip()

if not supervisorctl_path:
supervisorctl_path = "/opt/parallelcluster/pyenv/versions/3.12.11/envs/cookbook_virtualenv/bin/supervisorctl"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic to call supervisorctl is duplicated in this test. Let's move it to an utility function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"""Verify that metadata_db.json was updated (cfn-hup processed the change)."""
logger.info("Verifying metadata_db.json is updated...")

result = remote_command_executor.run_remote_command(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about checking existence of a file in a node.
This is a logic that, if it does not exist yet, it would be helpful to other tests as well.
I suggest to move to an utility function that can be used by other tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just two lines and we have many options to check if a file exists, I don't think people will check util.py first before they write their own way to check if it exists. Let's skip this.

"""Verify that dna.json files are cleaned up after update failure."""
logger.info("Verifying dna.json files are cleaned up...")

result = remote_command_executor.run_remote_command(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about checking existence of a files in a node.
This is a logic that, if it does not exist yet, it would be helpful to other tests as well.
I suggest to move to an utility function that can be used by other tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just two lines and we have many options to check if a file exists, I don't think people will check util.py first before they write their own way to check if it exists. Let's skip this.

return _check_version()


def _verify_compute_node_config_version_in_ddb(region, cluster_name, instance_id, expected_version):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about checking the config version deployed to a cluster node.
This is a logic that, if it does not exist yet, it would be helpful to other tests as well.
I suggest to move to an utility function that can be used by other tests in future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Done

assert_that(result.stdout.strip()).is_equal_to("exists")

# Also check the modification time is recent (within last hour)
result = remote_command_executor.run_remote_command("stat -c %Y /var/lib/cfn-hup/data/metadata_db.json")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about checking update time of a file in a node.
This is a logic that, if it does not exist yet, it would be helpful to other tests as well.
I suggest to move to an utility function that can be used by other tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

…f cluster readiness check takes more than 10 mins
logger.info(f"cfn-hup stopped on {node_name} ✓")


def _wait_for_stack_rollback_complete(cluster, region, timeout_minutes=60):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little bit concerned by the duration of the test. Can we shorten the 60min titomeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shorten to 30 mins

slurm_commands = SlurmCommands(remote_command_executor)

# Wait for all static nodes to be ready
_wait_for_static_nodes_ready(slurm_commands, expected_count=3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We refer to 3 in multiple places. Can we define a variable, e.g. n_static_nodes = 3 and use it everywhere we need it, including the cluster config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Done!

Instances:
- InstanceType: c5.large
MinCount: 3
MaxCount: 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 2 dynamic nodes? It seems that the test only needs the 3 static ones

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

- Rename test from test_update_rollback_fixes to test_update_rollback_failure
- Consolidate documentation into function docstring
- Add _get_supervisorctl_path utility function to reduce code duplication
- Simplify retry logic by using @Retry decorator directly on functions
- Add _wait_for_head_node_rollback_complete to ensure rollback recipe
  finishes before running assertions (CFN stack completes before head node)
- Remove redundant assertions and improve log messages
- Remove log_error=False
- Move verify_cluster_node_config_version_in_ddb to utils.py for reuse
- Add get_file_mtime_age_seconds utility function to utils.py
- Define N_STATIC_NODES constant and use template variable in configs
- Modify the timeout of rollback to 30 minutes
…f modification time

- Add 5-minute retry loop for metadata_db.json existence check since
  the file may be temporarily removed during cfn-hup update
- Reduce the check of modification time from 1 hour to 10 minutes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3.x skip-changelog-update Disables the check that enforces changelog updates in PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants