HDFS-17840. Fix incorrect Nodes-in-Service count in NameNode UI StorageType stats#8326
Open
balodesecurity wants to merge 1 commit intoapache:trunkfrom
Open
HDFS-17840. Fix incorrect Nodes-in-Service count in NameNode UI StorageType stats#8326balodesecurity wants to merge 1 commit intoapache:trunkfrom
balodesecurity wants to merge 1 commit intoapache:trunkfrom
Conversation
…geType stats. The StorageType stats map maintained a nodesInService counter using increments/decrements (via StorageTypeStats.addNode / subtractNode). When nodesInService dropped to 0, the entry for that storage type was removed from the map — even when decommissioning nodes still used the storage type and still contributed capacity data. When the entry was later recreated by an addStorage call, it started fresh with nodesInService = 0. Subsequent in-service node heartbeats then performed subtract (no-op, entry was gone) followed by add (creates entry, nodesInService = 1), which was correct. But any in-service node whose subtract ran against the freshly-created entry saw nodesInService decrement past 0 to -1, and then add brought it back to 0 — so that node's in-service contribution was lost for the rest of the session. Fix: add a totalNodes counter to StorageTypeStats that tracks ALL nodes using a storage type (in-service + decommissioning + maintenance). Change the map-entry removal condition from nodesInService == 0 to totalNodes == 0. An entry is now removed only when no node of any admin state still uses that storage type, preventing the premature removal that caused the count corruption. Added TestStorageTypeStatsMap with 4 unit tests covering: - Basic add/remove correctness - Entry survival when a decommissioning node still uses the storage type - nodesInService stability after the last in-service node decommissions - Entry removal only when all nodes (including decommissioning) are gone Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
💔 -1 overall
This message was automatically generated. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The Nodes in Service count displayed per storage type in the NameNode UI (DFS Storage Types section) could become grossly incorrect — e.g. showing 3 nodes when the cluster has 26.
Root Cause
DatanodeStats.StorageTypeStatsMapmaintains aStorageTypeStatsentry per storage type with an incrementalnodesInServicecounter. The map entry was removed whenevernodesInServicedropped to 0 — even when decommissioning/maintenance nodes still used the same storage type.The premature removal caused a cascade:
nodesInServicedrops to 0 → DISK entry removed.nodesInService = 0).subtract(B)runs against the fresh entry →nodesInService: 0→-1. Thenadd(B)→nodesInService: -1→0. B's in-service contribution is lost.Fix
Add a
totalNodescounter toStorageTypeStatsthat tracks all nodes using a storage type (in-service + decommissioning + maintenance). Change the map-entry removal condition fromnodesInService == 0tototalNodes == 0. An entry is now only removed when no node of any admin state still uses that storage type.Changed files:
StorageTypeStats.java— newtotalNodesfield;addNode/subtractNodealways update it; newgetTotalNodes()accessorDatanodeStats.java— removal condition updated togetTotalNodes() == 0TestStorageTypeStatsMap.java— 4 new unit tests (new file)Test plan
TestStorageTypeStatsMap(4 tests) — PASStestBasicAddRemove— basic correctnesstestEntryNotRemovedWhenDecommissioningNodeRemains— entry survives when a decommissioning node still uses the storage type; nodesInService stays correcttestEntryNotRemovedWhenLastInServiceDecommissions— entry survives when the last in-service node decommissions; new in-service node is counted correctlytestEntryRemovedOnlyWhenAllNodesGone— entry removed only after all nodes (including decommissioning) are gone