Skip to content

Conversation

WalBeh
Copy link
Contributor

@WalBeh WalBeh commented Sep 23, 2025

Summary of changes

Added

  • Persistent Logging: Dual logging to both STDOUT and persistent file with automatic rotation

    • New --log-file CLI flag (default: /resource/heapdump/dc_util.log)
    • Automatic file rotation when approaching 1MB to prevent disk space issues
    • Failsafe design - continues STDOUT logging even if file logging fails
    • Essential for debugging Kubernetes lifecycle hooks where container logs may not be accessible
    • Creates directory structure if it doesn't exist
  • PostStart Hook Detection: Intelligent detection of StatefulSet PostStart hooks

    • Automatically scans StatefulSet containers for PostStart hooks with dc_util --reset-routing
    • Prevents routing allocation changes when no PostStart hook exists to reset them
    • Solves historical issue where NEW_PRIMARIES routing allocation could not be reliably reset
    • Supports both single dash (-reset-routing) and double dash (--reset-routing) flag formats
    • Precise word boundary matching prevents false positives from similar flag names
    • Logs clear messages when PostStart hooks are found or missing
  • Single Node Cluster Detection: Automatic detection and handling of single node clusters

    • Detects when StatefulSet has exactly 1 replica and skips decommission
    • Prevents unnecessary overhead and potential failures in single node deployments
    • Clear logging explains why decommission was skipped
    • Maintains existing behavior for multi-node clusters (≥2 replicas)
  • Configurable Lock File Path: New --lock-file CLI flag

    • Default: /resource/heapdump/dc_util.lock
    • Allows customization for different deployment scenarios
    • All lock file operations now use configurable path
  • Enhanced Flag Support: Improved command-line flag handling

    • Both -reset-routing and --reset-routing formats now supported
    • Maintains backward compatibility with existing deployments
    • Better error handling and validation
  • Multi-Architecture Support: Automatic CPU architecture detection in hook configurations

    • Hook examples now include automatic detection of x86_64/amd64 and aarch64/arm64 architectures
    • Downloads appropriate binary based on detected architecture (dc_util-linux-amd64 or dc_util-linux-arm64)
    • Eliminates need for separate configuration files for different node architectures
    • Graceful error handling for unsupported architectures

Changed

  • Routing Allocation Logic: Enhanced PreStop process with PostStart hook detection

    • Routing allocation changes now only occur when corresponding PostStart hook exists
    • Prevents permanent cluster misconfiguration in deployments without PostStart hooks
    • More intelligent decision making based on actual StatefulSet configuration
  • Replica Count Handling: Improved logic for different cluster sizes

    • Zero replicas (scaled down): Skips decommission with clear logging
    • Single replica: Skips decommission to prevent failures
    • Multiple replicas: Proceeds with normal decommission process
    • Better log messages explaining the decision for each scenario
  • Function Signatures: Updated internal functions to support configurable paths

    • createLockFile() now accepts lock file path parameter
    • removeLockFile() now accepts lock file path parameter
    • lockFileExists() now accepts lock file path parameter
    • handleResetRouting() now accepts lock file path parameter

Improved

  • Logging Experience: Comprehensive logging improvements

    • All log messages now appear in both STDOUT and persistent file
    • Better visibility into hook execution for debugging
    • Historical logs available even after pod restarts
    • Easier troubleshooting and operations monitoring
  • Documentation: Extensively updated README.md

    • Added "Recent Updates" section highlighting new features
    • New "Replica Count Logic" section with examples
    • Updated CLI parameter table with new flags
    • Enhanced "PostStart Hook Detection" documentation
    • Added complete "Persistent Logging" section with usage examples
    • Updated sample logs sections to reflect new capabilities
    • All hook configuration examples now include automatic architecture detection
    • Clear separation between basic (preStop only) and complete (both hooks) configurations
  • Testing: Comprehensive test coverage for all new features

    • TestHasPostStartHookWithResetRouting: PostStart hook detection with various scenarios
    • TestPostStopRoutingAllocationIntegration: Integration tests for routing allocation logic
    • TestLoggingIntegration: Dual logging functionality verification
    • TestLogRotation: File rotation behavior validation
    • TestSingleNodeClusterBehavior: Single node cluster detection tests
    • TestReplicaCountBehavior: Comprehensive replica count handling tests
    • All existing tests updated to work with new function signatures

Checklist

  • Link to issue this PR refers to: https://github.com/crate/cloud/issues/2755
  • Relevant changes are reflected in CHANGES.rst
  • Added or changed code is covered by tests
  • Documentation has been updated if necessary
  • Changed code does not contain any breaking changes (or this is a major version change)

@WalBeh WalBeh requested a review from tomach September 23, 2025 16:00
Copy link
Contributor

@tomach tomach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 nice work! maybe in the tests, it might be safer to call extractNodeName() directly instead of re-implementing the parsing logic? then the tests will break if the function ever changes unexpectedly.

@WalBeh WalBeh requested review from seut and tomach October 9, 2025 17:49
Copy link
Member

@seut seut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice! But I'm not that much into this topic, especially code wise. So my review is very high level. Added just a testing related suggestion, leaving the approval up to @tomach.

}
}

func TestSplitHostname(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this already tested by the test for extractNodeName as the splitHostname function is used there already?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. splitHostname() is indeed called within extractNodeName(), so it is covered indirectly. I'd still lean toward keeping it for clarity and explicitly testing the helper in isolation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the splitHostname() NOT be called anymore as its supersede by extractHostName? If so, testing it dedicated may not make any sense. Not sure about Go, but can this method then be even declared private?

Copy link
Member

@seut seut Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'd remove the splitHostName method at all and inline the split call. Any tests related to this method is testing the library split method (which doesn't make sense, this is tested elsewhere already) rather than any application logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. it makes sense to inline the call and remove both the function and the dedicated test.

Copy link
Contributor

@tomach tomach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work! 👍

}
}

func TestSplitHostname(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. splitHostname() is indeed called within extractNodeName(), so it is covered indirectly. I'd still lean toward keeping it for clarity and explicitly testing the helper in isolation.

@WalBeh WalBeh changed the title Fix hostname parsing and add tests Fix hostname parsing and improve decommission process Oct 17, 2025
The tool can read configuration from StatefulSet labels, overriding CLI parameters. This is especially useful to dynamically adjust decommission settings, without restarting the POD
to pick up settings changes in the statefulset - think "short-cut" when needed. The tool has sensible defaults and can be run without parameters (except the `--reset-routing`).
As already mentioned `terminationGracePeriodSeconds` MUST be set larger then `--tiemout` otherwise kubelet will SIGKILL the container before decommission finished!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type on --timeout

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants