-
Notifications
You must be signed in to change notification settings - Fork 314
[Release-3.14.1][Test] Add integration test for the fixes of issues caused by cluster update and rollback failure #7150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release-3.14
Are you sure you want to change the base?
[Release-3.14.1][Test] Add integration test for the fixes of issues caused by cluster update and rollback failure #7150
Conversation
… and rollback failure Add integration test to verify the following fixes work correctly: - [F1] clustermgtd remains running after both update and rollback fail - [F2] cfn-hup does not enter endless loop after rollback to state older than 24h - [F3] dna.json files are cleaned up after update failure Test scenario: 1. Create cluster with 3 static compute nodes 2. Inject cfn-signal failure on head node (simulating expired wait condition) 3. Disable cfn-hup on CN1 before update (causes update to fail) 4. Trigger cluster update (add new queue) 5. Wait for CN2 to apply update, then disable its cfn-hup 6. Update fails (CN1 didn't update), rollback fails (CN2 won't rollback) 7. Verify: clustermgtd running, dna.json cleaned up, CN3 has correct config version, metadata_db.json updated, no cfn-hup endless loop
- Fix an error, now use supervisorctl instead of systemctl to stop cfn-hup on compute nodes - Use srun to execute commands on compute nodes instead of SSM - Add retry logic for clustermgtd running verification (10 min timeout) - Only wait for UPDATE_ROLLBACK_COMPLETE state instead of multiple final states
supervisorctl status returns exit code 3 when a process is in STOPPED state. Use raise_on_error=False when verifying cfn-hup is stopped.
| instances: {{ common.INSTANCES_DEFAULT_X86 }} | ||
| oss: [{{ OS_X86_3 }}] | ||
| schedulers: ["slurm"] | ||
| test_update_rollback_fixes.py::test_update_rollback_fixes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about naming the test test_update_rollback_failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense!
| Scheduling: | ||
| Scheduler: slurm | ||
| SlurmSettings: | ||
| ScaledownIdletime: 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and in other parts of the config.
Is this meaningful for the test? If it is, why? If not, let's remove it and use the default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching the test configuration file issues, I copy pasted these configs from some test_slurm tests config, I wanted to check them before push the changes to GH but forgot.:blush:
Removing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| ComputeSettings: | ||
| LocalStorage: | ||
| RootVolume: | ||
| Size: 45 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and in other parts of the config.
Is this meaningful for the test? If it is, why? If not, let's remove it and use the default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed!
| KeyName: {{ key_name }} | ||
| Iam: | ||
| AdditionalIamPolicies: | ||
| - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and in other parts of the config.
Adding this policy is not required as it is automatically injected by our test framework.
See https://github.com/aws/aws-parallelcluster/blob/release-3.14/tests/integration-tests/conftest.py#L858-L866
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed!
| Networking: | ||
| SubnetIds: | ||
| - {{ private_subnet_id }} | ||
| Monitoring: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here an in the other config files, why we need to explicitly set this to true?
If there is not a valid reason, let's remove this params and rely on the default value, which is true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed!
| This test validates the following fixes: | ||
| - [F1] clustermgtd remains running after both update and rollback fail | ||
| - [F2] cfn-hup does not enter an endless loop after rollback to a state older than 24h | ||
| - [F3] dna.json files are cleaned up after update failure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor] Worth mentioning :update and rollback failure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| Integration tests for verifying fixes related to cluster update rollback scenarios. | ||
| This test validates the following fixes: | ||
| - [F1] clustermgtd remains running after both update and rollback fail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor] Worth mentioning: we expect it to be running if the update and rollback fail in the section of the update that we consider safe, which is after the slurm reconfiguration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added!
| 9. Verify fixes: | ||
| - clustermgtd is running | ||
| - dna.json files are deleted | ||
| - CN3 (healthy node) has correct config version (source config before update) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LEt's also capture what we expect for the unhealthy nodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
| scheduler_commands_factory, | ||
| ): | ||
| """ | ||
| Test that cluster update rollback fixes work correctly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a repetition of what we documented in the comment aboie. Let's consolidate the comments into one comprehensive description to be put only here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
|
||
| # Get compute node hostnames | ||
| compute_nodes = slurm_commands.get_compute_nodes() | ||
| assert_that(len(compute_nodes)).is_greater_than_or_equal_to(3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this assertion required if we already have _wait_for_static_nodes_ready(slurm_commands, expected_count=3) above?
If not required, let's remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
| logger.info(f"CN3: {cn3} -> {cn3_instance_id}") | ||
|
|
||
| # Get initial config version from DynamoDB (dna.json is cleaned up after successful create) | ||
| initial_config_version = _get_config_version_from_ddb(region, cluster.name, cn3_instance_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why retrieving it from the DDB record of a specific compute rather than from the head node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean, I didn't get it. dna.json is cleaned up on HeadNode after cluster create, we can not get the config version from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I got you, you mean get record from DDB, but with HeadNode id. What's the difference? The goal of the test is to verify whether the compute node (CN3) has the correct configuration version after the rollback. So, I think if we directly get the baseline value from the DDB record with CN3 ID, and then verify whether it remains unchanged, it makes the test logic clearer.
|
|
||
| # Get the target config version from the update config file | ||
| # We'll use this to verify CN2 has applied the update before disabling its cfn-hup | ||
| cluster.update(str(updated_config_file), wait=False, raise_on_error=False, log_error=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why log_error=False? even if we want to ignore the failure b/c the failure is expected, it could be useful to emit the error that triggers the expected failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, removed, thank you for pointing this out
| region, cluster.name, cn2_instance_id, initial_config_version, timeout_minutes=15 | ||
| ) | ||
|
|
||
| logger.info(f"CN2 has applied the update. Disabling cfn-hup on CN2 ({cn2})...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's specify in this log line that we are doing this way to inject rollback failure. In this way it will be more clear the goal when we will go through the test logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
|
||
| # Wait for stack to reach UPDATE_ROLLBACK_COMPLETE state | ||
| logger.info("Waiting for stack to reach UPDATE_ROLLBACK_COMPLETE...") | ||
| final_status = _wait_for_stack_rollback_complete(cluster, region) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stack reaches the state UPDATE_ROLLBACK_COMPLETE before the actual rollback completed in the head node. This is a known limitation of our rollback. We should consider this before executing the assertions below, otherwise we risk false positive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes a lot of sense, I added a new logic that check the last line of chef-client log, to ensure chef finished.
| logger.info(f"cfn-hup stopped on {node_name} ✓") | ||
|
|
||
|
|
||
| def _wait_for_stack_rollback_complete(cluster, region, timeout_minutes=60): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to nest functions here? If the only reason is to inject the timeout_minutes parameter, I suggest to simplify by removing the nested functions and use a static timeout that we think it is reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| result = remote_command_executor.run_remote_command( | ||
| "find /opt/parallelcluster -name supervisorctl -type f 2>/dev/null | head -1" | ||
| ) | ||
| supervisorctl_path = result.stdout.strip() | ||
|
|
||
| if not supervisorctl_path: | ||
| supervisorctl_path = "/opt/parallelcluster/pyenv/versions/3.12.11/envs/cookbook_virtualenv/bin/supervisorctl" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic to call supervisorctl is duplicated in this test. Let's move it to an utility function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| """Verify that metadata_db.json was updated (cfn-hup processed the change).""" | ||
| logger.info("Verifying metadata_db.json is updated...") | ||
|
|
||
| result = remote_command_executor.run_remote_command( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is about checking existence of a file in a node.
This is a logic that, if it does not exist yet, it would be helpful to other tests as well.
I suggest to move to an utility function that can be used by other tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just two lines and we have many options to check if a file exists, I don't think people will check util.py first before they write their own way to check if it exists. Let's skip this.
| """Verify that dna.json files are cleaned up after update failure.""" | ||
| logger.info("Verifying dna.json files are cleaned up...") | ||
|
|
||
| result = remote_command_executor.run_remote_command( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is about checking existence of a files in a node.
This is a logic that, if it does not exist yet, it would be helpful to other tests as well.
I suggest to move to an utility function that can be used by other tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just two lines and we have many options to check if a file exists, I don't think people will check util.py first before they write their own way to check if it exists. Let's skip this.
| return _check_version() | ||
|
|
||
|
|
||
| def _verify_compute_node_config_version_in_ddb(region, cluster_name, instance_id, expected_version): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is about checking the config version deployed to a cluster node.
This is a logic that, if it does not exist yet, it would be helpful to other tests as well.
I suggest to move to an utility function that can be used by other tests in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Done
| assert_that(result.stdout.strip()).is_equal_to("exists") | ||
|
|
||
| # Also check the modification time is recent (within last hour) | ||
| result = remote_command_executor.run_remote_command("stat -c %Y /var/lib/cfn-hup/data/metadata_db.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is about checking update time of a file in a node.
This is a logic that, if it does not exist yet, it would be helpful to other tests as well.
I suggest to move to an utility function that can be used by other tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
…f cluster readiness check takes more than 10 mins
| logger.info(f"cfn-hup stopped on {node_name} ✓") | ||
|
|
||
|
|
||
| def _wait_for_stack_rollback_complete(cluster, region, timeout_minutes=60): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little bit concerned by the duration of the test. Can we shorten the 60min titomeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shorten to 30 mins
| slurm_commands = SlurmCommands(remote_command_executor) | ||
|
|
||
| # Wait for all static nodes to be ready | ||
| _wait_for_static_nodes_ready(slurm_commands, expected_count=3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We refer to 3 in multiple places. Can we define a variable, e.g. n_static_nodes = 3 and use it everywhere we need it, including the cluster config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Done!
| Instances: | ||
| - InstanceType: c5.large | ||
| MinCount: 3 | ||
| MaxCount: 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 2 dynamic nodes? It seems that the test only needs the 3 static ones
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
- Rename test from test_update_rollback_fixes to test_update_rollback_failure - Consolidate documentation into function docstring - Add _get_supervisorctl_path utility function to reduce code duplication - Simplify retry logic by using @Retry decorator directly on functions - Add _wait_for_head_node_rollback_complete to ensure rollback recipe finishes before running assertions (CFN stack completes before head node) - Remove redundant assertions and improve log messages - Remove log_error=False
- Move verify_cluster_node_config_version_in_ddb to utils.py for reuse - Add get_file_mtime_age_seconds utility function to utils.py - Define N_STATIC_NODES constant and use template variable in configs - Modify the timeout of rollback to 30 minutes
…f modification time - Add 5-minute retry loop for metadata_db.json existence check since the file may be temporarily removed during cfn-hup update - Reduce the check of modification time from 1 hour to 10 minutes
Description of changes
Add integration test to verify the following fixes work correctly:
Test scenario:
Tests
References
Checklist
developadd the branch name as prefix in the PR title (e.g.[release-3.6]).Please review the guidelines for contributing and Pull Request Instructions.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.