Fix hostname parsing and improve decommission process #773

WalBeh · 2025-09-23T16:00:44Z

Summary of changes

Added

Persistent Logging: Dual logging to both STDOUT and persistent file with automatic rotation
- New --log-file CLI flag (default: /resource/heapdump/dc_util.log)
- Automatic file rotation when approaching 1MB to prevent disk space issues
- Failsafe design - continues STDOUT logging even if file logging fails
- Essential for debugging Kubernetes lifecycle hooks where container logs may not be accessible
- Creates directory structure if it doesn't exist
PostStart Hook Detection: Intelligent detection of StatefulSet PostStart hooks
- Automatically scans StatefulSet containers for PostStart hooks with dc_util --reset-routing
- Prevents routing allocation changes when no PostStart hook exists to reset them
- Solves historical issue where NEW_PRIMARIES routing allocation could not be reliably reset
- Supports both single dash (-reset-routing) and double dash (--reset-routing) flag formats
- Precise word boundary matching prevents false positives from similar flag names
- Logs clear messages when PostStart hooks are found or missing
Single Node Cluster Detection: Automatic detection and handling of single node clusters
- Detects when StatefulSet has exactly 1 replica and skips decommission
- Prevents unnecessary overhead and potential failures in single node deployments
- Clear logging explains why decommission was skipped
- Maintains existing behavior for multi-node clusters (≥2 replicas)
Configurable Lock File Path: New --lock-file CLI flag
- Default: /resource/heapdump/dc_util.lock
- Allows customization for different deployment scenarios
- All lock file operations now use configurable path
Enhanced Flag Support: Improved command-line flag handling
- Both -reset-routing and --reset-routing formats now supported
- Maintains backward compatibility with existing deployments
- Better error handling and validation
Multi-Architecture Support: Automatic CPU architecture detection in hook configurations
- Hook examples now include automatic detection of x86_64/amd64 and aarch64/arm64 architectures
- Downloads appropriate binary based on detected architecture (dc_util-linux-amd64 or dc_util-linux-arm64)
- Eliminates need for separate configuration files for different node architectures
- Graceful error handling for unsupported architectures

Changed

Routing Allocation Logic: Enhanced PreStop process with PostStart hook detection
- Routing allocation changes now only occur when corresponding PostStart hook exists
- Prevents permanent cluster misconfiguration in deployments without PostStart hooks
- More intelligent decision making based on actual StatefulSet configuration
Replica Count Handling: Improved logic for different cluster sizes
- Zero replicas (scaled down): Skips decommission with clear logging
- Single replica: Skips decommission to prevent failures
- Multiple replicas: Proceeds with normal decommission process
- Better log messages explaining the decision for each scenario
Function Signatures: Updated internal functions to support configurable paths
- createLockFile() now accepts lock file path parameter
- removeLockFile() now accepts lock file path parameter
- lockFileExists() now accepts lock file path parameter
- handleResetRouting() now accepts lock file path parameter

Improved

Logging Experience: Comprehensive logging improvements
- All log messages now appear in both STDOUT and persistent file
- Better visibility into hook execution for debugging
- Historical logs available even after pod restarts
- Easier troubleshooting and operations monitoring
Documentation: Extensively updated README.md
- Added "Recent Updates" section highlighting new features
- New "Replica Count Logic" section with examples
- Updated CLI parameter table with new flags
- Enhanced "PostStart Hook Detection" documentation
- Added complete "Persistent Logging" section with usage examples
- Updated sample logs sections to reflect new capabilities
- All hook configuration examples now include automatic architecture detection
- Clear separation between basic (preStop only) and complete (both hooks) configurations
Testing: Comprehensive test coverage for all new features
- TestHasPostStartHookWithResetRouting: PostStart hook detection with various scenarios
- TestPostStopRoutingAllocationIntegration: Integration tests for routing allocation logic
- TestLoggingIntegration: Dual logging functionality verification
- TestLogRotation: File rotation behavior validation
- TestSingleNodeClusterBehavior: Single node cluster detection tests
- TestReplicaCountBehavior: Comprehensive replica count handling tests
- All existing tests updated to work with new function signatures

Checklist

Link to issue this PR refers to: https://github.com/crate/cloud/issues/2755
Relevant changes are reflected in CHANGES.rst
Added or changed code is covered by tests
Documentation has been updated if necessary
Changed code does not contain any breaking changes (or this is a major version change)

tomach

👍 nice work! maybe in the tests, it might be safer to call extractNodeName() directly instead of re-implementing the parsing logic? then the tests will break if the function ever changes unexpectedly.

seut

Looks nice! But I'm not that much into this topic, especially code wise. So my review is very high level. Added just a testing related suggestion, leaving the approval up to @tomach.

seut · 2025-10-10T10:36:42Z

utils/dc_util/decommission_test.go

+	}
+}
+
+func TestSplitHostname(t *testing.T) {


Isn't this already tested by the test for extractNodeName as the splitHostname function is used there already?

Good point. splitHostname() is indeed called within extractNodeName(), so it is covered indirectly. I'd still lean toward keeping it for clarity and explicitly testing the helper in isolation.

Shouldn't the splitHostname() NOT be called anymore as its supersede by extractHostName? If so, testing it dedicated may not make any sense. Not sure about Go, but can this method then be even declared private?

Actually, I'd remove the splitHostName method at all and inline the split call. Any tests related to this method is testing the library split method (which doesn't make sense, this is tested elsewhere already) rather than any application logic.

agreed. it makes sense to inline the call and remove both the function and the dedicated test.

tomach

nice work! 👍

utils/dc_util/README.md

tomach · 2025-10-10T13:54:23Z

utils/dc_util/decommission_test.go

+	}
+}
+
+func TestSplitHostname(t *testing.T) {


Good point. splitHostname() is indeed called within extractNodeName(), so it is covered indirectly. I'd still lean toward keeping it for clarity and explicitly testing the helper in isolation.

goat-ssh · 2025-10-17T16:29:39Z

utils/dc_util/README.md

+The tool can read configuration from StatefulSet labels, overriding CLI parameters. This is especially useful to dynamically adjust decommission settings, without restarting the POD
+to pick up settings changes in the statefulset - think "short-cut" when needed. The tool has sensible defaults and can be run without parameters (except the `--reset-routing`).
+
+As already mentioned `terminationGracePeriodSeconds` MUST be set larger then `--tiemout` otherwise kubelet will SIGKILL the container before decommission finished!


type on --timeout

WalBeh requested a review from tomach September 23, 2025 16:00

tomach approved these changes Sep 25, 2025

View reviewed changes

WalBeh requested review from seut and tomach October 9, 2025 17:49

seut reviewed Oct 10, 2025

View reviewed changes

tomach approved these changes Oct 10, 2025

View reviewed changes

Fix hostname parsing and add tests

79cd722

WalBeh force-pushed the bw/fix-node-selection branch from d64c72e to 79cd722 Compare October 14, 2025 08:00

WalBeh mentioned this pull request Oct 14, 2025

Non-Blocking ALTER CLUSTER DECOMMISSION in Containerized Environments crate/crate#18510

Open

WalBeh added 5 commits October 16, 2025 13:13

Add disable label

574aff0

Add --dry-un

7156c37

Add --reset-routing as a Post Start Hook

64c72ad

Add persistent logging and PostStart hook detection

414b2b3

Add multi-architecture support and Makefile, updated README.md

85cb1c2

WalBeh changed the title ~~Fix hostname parsing and add tests~~ Fix hostname parsing and improve decommission process Oct 17, 2025

goat-ssh reviewed Oct 17, 2025

View reviewed changes

WalBeh added 3 commits October 20, 2025 10:08

use upload-artifact v4

b5288ca

fixup! use upload-artifact v4

6d45418

fixup! fixup! use upload-artifact v4

4fa5486

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix hostname parsing and improve decommission process #773

Fix hostname parsing and improve decommission process #773

Uh oh!

WalBeh commented Sep 23, 2025 •

edited

Loading

Uh oh!

tomach left a comment

Uh oh!

seut left a comment •

edited

Loading

Uh oh!

seut Oct 10, 2025

Uh oh!

tomach Oct 10, 2025

Uh oh!

seut Oct 10, 2025

Uh oh!

seut Oct 10, 2025 •

edited

Loading

Uh oh!

tomach Oct 10, 2025

Uh oh!

tomach left a comment

Uh oh!

Uh oh!

Uh oh!

tomach Oct 10, 2025

Uh oh!

goat-ssh Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix hostname parsing and improve decommission process #773

Are you sure you want to change the base?

Fix hostname parsing and improve decommission process #773

Uh oh!

Conversation

WalBeh commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of changes

Added

Changed

Improved

Checklist

Uh oh!

tomach left a comment

Choose a reason for hiding this comment

Uh oh!

seut left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seut Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

tomach Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

seut Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

seut Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomach Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

tomach left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tomach Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

goat-ssh Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

WalBeh commented Sep 23, 2025 •

edited

Loading

seut left a comment •

edited

Loading

seut Oct 10, 2025 •

edited

Loading