Skip to content

HDFS-17840. Fix incorrect Nodes-in-Service count in NameNode UI StorageType stats#8326

Open
balodesecurity wants to merge 1 commit intoapache:trunkfrom
balodesecurity:HDFS-17840
Open

HDFS-17840. Fix incorrect Nodes-in-Service count in NameNode UI StorageType stats#8326
balodesecurity wants to merge 1 commit intoapache:trunkfrom
balodesecurity:HDFS-17840

Conversation

@balodesecurity
Copy link

Summary

The Nodes in Service count displayed per storage type in the NameNode UI (DFS Storage Types section) could become grossly incorrect — e.g. showing 3 nodes when the cluster has 26.

Root Cause

DatanodeStats.StorageTypeStatsMap maintains a StorageTypeStats entry per storage type with an incremental nodesInService counter. The map entry was removed whenever nodesInService dropped to 0 — even when decommissioning/maintenance nodes still used the same storage type.

The premature removal caused a cascade:

  1. Node A (last in-service node for DISK) starts decommissioning → nodesInService drops to 0 → DISK entry removed.
  2. Next heartbeat from any node recreates the entry fresh (nodesInService = 0).
  3. When in-service node B heartbeats: subtract(B) runs against the fresh entry → nodesInService: 0→-1. Then add(B)nodesInService: -1→0. B's in-service contribution is lost.
  4. After enough such cycles the reported count is far below the real number.

Fix

Add a totalNodes counter to StorageTypeStats that tracks all nodes using a storage type (in-service + decommissioning + maintenance). Change the map-entry removal condition from nodesInService == 0 to totalNodes == 0. An entry is now only removed when no node of any admin state still uses that storage type.

Changed files:

  • StorageTypeStats.java — new totalNodes field; addNode/subtractNode always update it; new getTotalNodes() accessor
  • DatanodeStats.java — removal condition updated to getTotalNodes() == 0
  • TestStorageTypeStatsMap.java — 4 new unit tests (new file)

Test plan

  • TestStorageTypeStatsMap (4 tests) — PASS
    • testBasicAddRemove — basic correctness
    • testEntryNotRemovedWhenDecommissioningNodeRemains — entry survives when a decommissioning node still uses the storage type; nodesInService stays correct
    • testEntryNotRemovedWhenLastInServiceDecommissions — entry survives when the last in-service node decommissions; new in-service node is counted correctly
    • testEntryRemovedOnlyWhenAllNodesGone — entry removed only after all nodes (including decommissioning) are gone
  • Full blockmanagement test suite (CI)

…geType stats.

The StorageType stats map maintained a nodesInService counter using
increments/decrements (via StorageTypeStats.addNode / subtractNode).
When nodesInService dropped to 0, the entry for that storage type was
removed from the map — even when decommissioning nodes still used the
storage type and still contributed capacity data.

When the entry was later recreated by an addStorage call, it started
fresh with nodesInService = 0.  Subsequent in-service node heartbeats
then performed subtract (no-op, entry was gone) followed by add (creates
entry, nodesInService = 1), which was correct.  But any in-service node
whose subtract ran against the freshly-created entry saw nodesInService
decrement past 0 to -1, and then add brought it back to 0 — so that
node's in-service contribution was lost for the rest of the session.

Fix: add a totalNodes counter to StorageTypeStats that tracks ALL nodes
using a storage type (in-service + decommissioning + maintenance).
Change the map-entry removal condition from nodesInService == 0 to
totalNodes == 0.  An entry is now removed only when no node of any
admin state still uses that storage type, preventing the premature
removal that caused the count corruption.

Added TestStorageTypeStatsMap with 4 unit tests covering:
- Basic add/remove correctness
- Entry survival when a decommissioning node still uses the storage type
- nodesInService stability after the last in-service node decommissions
- Entry removal only when all nodes (including decommissioning) are gone

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 14m 8s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 43m 52s trunk passed
+1 💚 compile 1m 41s trunk passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 compile 1m 47s trunk passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 checkstyle 1m 45s trunk passed
+1 💚 mvnsite 1m 48s trunk passed
+1 💚 javadoc 2m 48s trunk passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 javadoc 1m 27s trunk passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 spotbugs 4m 24s trunk passed
+1 💚 shadedclient 33m 6s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 21s the patch passed
+1 💚 compile 1m 16s the patch passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 javac 1m 16s the patch passed
+1 💚 compile 1m 20s the patch passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 javac 1m 20s the patch passed
+1 💚 blanks 0m 1s The patch has no blanks issues.
+1 💚 checkstyle 1m 16s the patch passed
+1 💚 mvnsite 1m 28s the patch passed
+1 💚 javadoc 0m 59s the patch passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 javadoc 1m 0s the patch passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 spotbugs 3m 56s the patch passed
+1 💚 shadedclient 33m 23s patch has no errors when building and testing our client artifacts.
_ Other Tests _
-1 ❌ unit 217m 27s /patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt hadoop-hdfs in the patch passed.
+1 💚 asflicense 0m 45s The patch does not generate ASF License warnings.
368m 44s
Reason Tests
Failed junit tests hadoop.hdfs.TestBlockRecoveryCauseStandbyNameNodeCrash
Subsystem Report/Notes
Docker ClientAPI=1.54 ServerAPI=1.54 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8326/1/artifact/out/Dockerfile
GITHUB PR #8326
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux c2b8f121b3d5 5.15.0-168-generic #178-Ubuntu SMP Fri Jan 9 19:05:03 UTC 2026 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 6ddc72c
Default Java Ubuntu-17.0.18+8-Ubuntu-124.04.1
Multi-JDK versions /usr/lib/jvm/java-21-openjdk-amd64:Ubuntu-21.0.10+7-Ubuntu-124.04 /usr/lib/jvm/java-17-openjdk-amd64:Ubuntu-17.0.18+8-Ubuntu-124.04.1
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8326/1/testReport/
Max. process+thread count 3080 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8326/1/console
versions git=2.43.0 maven=3.9.11 spotbugs=4.9.7
Powered by Apache Yetus 0.14.1 https://yetus.apache.org

This message was automatically generated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants