Cluster: Enhance debugging in logs #2815

zhijun42 · 2025-11-07T06:58:15Z

Make a couple of small changes to enhance debugging.

[1] Add human node names in cluster tests so that we can easily understand which nodes we are interacting with:

pong packet received from: 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) from client: :0
node 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) announces that it is a primary in shard c6d1152caee49a5e70cb4b77d1549386078be603
Reconfiguring node 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) as primary for shard c6d1152caee49a5e70cb4b77d1549386078be603
Configuration change detected. Reconfiguring myself as a replica of node 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) in shard c6d1152caee49a5e70cb4b77d1549386078be603

[2] Currently there are logs showing the address of incoming connections:

Accepting cluster node connection from 127.0.0.1:59956
Accepting cluster node connection from 127.0.0.1:59957
Accepting cluster node connection from 127.0.0.1:59958
Accepting cluster node connection from 127.0.0.1:59959

but we have no idea which nodes these connections refer to. I added a logging statement when the node is set to the inbound link connection.

Bound cluster node 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) to connection of client 127.0.0.1:59956

[3] Add a debug log when processing a packet to show the packet type, sender node name, and sender client port (this also has the benefit of telling us whether this is an inbound or outbound link).

pong packet received from: 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) from client: :0
ping packet received from: 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) from client: 127.0.0.1:59973
fail packet received from: 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) from client: 127.0.0.1:59973
auth-req packet received from: 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) from client: 127.0.0.1:59973

codecov · 2025-11-07T07:18:14Z

Codecov Report

❌ Patch coverage is 58.33333% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.42%. Comparing base (dd2827a) to head (f030cd8).
⚠️ Report is 2 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/cluster_legacy.c	58.33%	5 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #2815      +/-   ##
============================================
- Coverage     72.43%   72.42%   -0.02%     
============================================
  Files           128      128              
  Lines         70428    70439      +11     
============================================
- Hits          51017    51016       -1     
- Misses        19411    19423      +12

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`87.35% <58.33%> (+0.23%)`	⬆️

... and 14 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

src/connection.h

src/cluster_legacy.c

Signed-off-by: Zhijun <dszhijun@gmail.com>

…heck Signed-off-by: Zhijun <dszhijun@gmail.com>

Signed-off-by: Zhijun <dszhijun@gmail.com>

.gitignore

JimB123

Not sure about the .gitignore file, but otherwise looks ok.

hpatro

[3] Add a debug log when processing a packet to show the packet type, sender node name, and sender client port (this also has the benefit of telling us whether this is an inbound or outbound link).

Not really sure about the benefit of ip/port addition to the nodename / (human nodename seems valuable) and wouldn't it be always the inbound link from another node as this is under clusterProcessPacket.

Signed-off-by: Zhijun <dszhijun@gmail.com>

zhijun42 · 2025-11-19T00:35:58Z

@hpatro

Not really sure about the benefit of ip/port addition to the nodename / (human nodename seems valuable)

Node A could fire up multiple connections to the other node B. For example, when I was investigating a tricky scenario in #2811, I needed to look at which connection node A was using to send packets to node B and why that connection was created/closed, which helped me figure out the ordering of packets. A single TCP/TLS connection can guarantee the delivery ordering of packets, but with multiple connections it's not clear which packets would be received first, and I needed to know that.

and wouldn't it be always the inbound link from another node as this is under clusterProcessPacket.

No, it can be either inbound or outbound.

Say node A sends a PING to node B via node A's outbound link, node B will reply with a PONG via the same connection. So node A will run clusterProcessPacket with link being outbound.
Say node B performs PONG broadcast to send PONGs to all other nodes, meaning node B proactively sends packets out (in contrast to passively replying messages). Then node A will run clusterProcessPacket with link being inbound.

If you turn on verbose/debug logging level and look at the logs carefully you will clearly see this pattern.

zhijun42 · 2025-11-23T23:42:28Z

@hpatro Could you approve if this looks good to you? ☺️

src/cluster_legacy.c

.gitignore

Signed-off-by: Zhijun <dszhijun@gmail.com>

zhijun42 · 2025-11-25T14:06:30Z

Looks like the known flakiness still exists

*** [err]: Primaries will not time out then they are elected in the same epoch in tests/unit/cluster/failover2.tcl
expected message found in log file: *Failover attempt expired*

github-actions bot assigned zhijun42 Nov 7, 2025

JimB123 suggested changes Nov 13, 2025

View reviewed changes

src/connection.h Outdated Show resolved Hide resolved

JimB123 suggested changes Nov 17, 2025

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

src/cluster_legacy.c Outdated Show resolved Hide resolved

src/cluster_legacy.c Outdated Show resolved Hide resolved

zhijun42 added 6 commits November 18, 2025 11:33

Cluster: Enhance debugging in logs

286659a

Signed-off-by: Zhijun <dszhijun@gmail.com>

Fix errors

7197288

Signed-off-by: Zhijun <dszhijun@gmail.com>

Clean up

85b13c7

Signed-off-by: Zhijun <dszhijun@gmail.com>

Use connAddr instead

904f684

Signed-off-by: Zhijun <dszhijun@gmail.com>

Replace connAddr with connAddrPeerName; Wrap logging with verbosity c…

e442181

…heck Signed-off-by: Zhijun <dszhijun@gmail.com>

Resolve conflict leftovers

018365b

Signed-off-by: Zhijun <dszhijun@gmail.com>

zhijun42 force-pushed the cluster-enhance-debuging branch from 496bf37 to 018365b Compare November 18, 2025 03:41

JimB123 reviewed Nov 18, 2025

View reviewed changes

.gitignore Outdated Show resolved Hide resolved

JimB123 approved these changes Nov 18, 2025

View reviewed changes

hpatro reviewed Nov 19, 2025

View reviewed changes

Remove ignore dir

3396288

Signed-off-by: Zhijun <dszhijun@gmail.com>

hpatro reviewed Nov 24, 2025

View reviewed changes

src/cluster_legacy.c Show resolved Hide resolved

.gitignore Outdated Show resolved Hide resolved

zhijun42 and others added 2 commits November 25, 2025 21:39

Recover gitignore

5e9d84b

Signed-off-by: Zhijun <dszhijun@gmail.com>

Merge branch 'valkey-io:unstable' into cluster-enhance-debuging

f030cd8

hpatro approved these changes Nov 26, 2025

View reviewed changes

hpatro merged commit da3c43d into valkey-io:unstable Nov 26, 2025
55 of 56 checks passed

zhijun42 deleted the cluster-enhance-debuging branch November 26, 2025 02:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cluster: Enhance debugging in logs #2815

Cluster: Enhance debugging in logs #2815

zhijun42 commented Nov 7, 2025

Uh oh!

codecov bot commented Nov 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JimB123 left a comment

Uh oh!

hpatro left a comment

Uh oh!

zhijun42 commented Nov 19, 2025

Uh oh!

zhijun42 commented Nov 23, 2025

Uh oh!

Uh oh!

Uh oh!

zhijun42 commented Nov 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Cluster: Enhance debugging in logs #2815

Cluster: Enhance debugging in logs #2815

Conversation

zhijun42 commented Nov 7, 2025

Uh oh!

codecov bot commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JimB123 left a comment

Choose a reason for hiding this comment

Uh oh!

hpatro left a comment

Choose a reason for hiding this comment

Uh oh!

zhijun42 commented Nov 19, 2025

Uh oh!

zhijun42 commented Nov 23, 2025

Uh oh!

Uh oh!

Uh oh!

zhijun42 commented Nov 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Nov 7, 2025 •

edited

Loading