Skip to content

Conversation

@zhijun42
Copy link
Contributor

@zhijun42 zhijun42 commented Nov 7, 2025

Make a couple of small changes to enhance debugging.

[1] Add human node names in cluster tests so that we can easily understand which nodes we are interacting with:

pong packet received from: 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) from client: :0
node 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) announces that it is a primary in shard c6d1152caee49a5e70cb4b77d1549386078be603
Reconfiguring node 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) as primary for shard c6d1152caee49a5e70cb4b77d1549386078be603
Configuration change detected. Reconfiguring myself as a replica of node 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) in shard c6d1152caee49a5e70cb4b77d1549386078be603

[2] Currently there are logs showing the address of incoming connections:

Accepting cluster node connection from 127.0.0.1:59956
Accepting cluster node connection from 127.0.0.1:59957
Accepting cluster node connection from 127.0.0.1:59958
Accepting cluster node connection from 127.0.0.1:59959

but we have no idea which nodes these connections refer to. I added a logging statement when the node is set to the inbound link connection.

Bound cluster node 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) to connection of client 127.0.0.1:59956

[3] Add a debug log when processing a packet to show the packet type, sender node name, and sender client port (this also has the benefit of telling us whether this is an inbound or outbound link).

pong packet received from: 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) from client: :0
ping packet received from: 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) from client: 127.0.0.1:59973
fail packet received from: 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) from client: 127.0.0.1:59973
auth-req packet received from: 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) from client: 127.0.0.1:59973

@codecov
Copy link

codecov bot commented Nov 7, 2025

Codecov Report

❌ Patch coverage is 58.33333% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.42%. Comparing base (dd2827a) to head (f030cd8).
⚠️ Report is 2 commits behind head on unstable.

Files with missing lines Patch % Lines
src/cluster_legacy.c 58.33% 5 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2815      +/-   ##
============================================
- Coverage     72.43%   72.42%   -0.02%     
============================================
  Files           128      128              
  Lines         70428    70439      +11     
============================================
- Hits          51017    51016       -1     
- Misses        19411    19423      +12     
Files with missing lines Coverage Δ
src/cluster_legacy.c 87.35% <58.33%> (+0.23%) ⬆️

... and 14 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Zhijun <dszhijun@gmail.com>
Signed-off-by: Zhijun <dszhijun@gmail.com>
Signed-off-by: Zhijun <dszhijun@gmail.com>
Signed-off-by: Zhijun <dszhijun@gmail.com>
…heck

Signed-off-by: Zhijun <dszhijun@gmail.com>
Signed-off-by: Zhijun <dszhijun@gmail.com>
@zhijun42 zhijun42 force-pushed the cluster-enhance-debuging branch from 496bf37 to 018365b Compare November 18, 2025 03:41
Copy link
Member

@JimB123 JimB123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the .gitignore file, but otherwise looks ok.

Copy link
Collaborator

@hpatro hpatro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[3] Add a debug log when processing a packet to show the packet type, sender node name, and sender client port (this also has the benefit of telling us whether this is an inbound or outbound link).

Not really sure about the benefit of ip/port addition to the nodename / (human nodename seems valuable) and wouldn't it be always the inbound link from another node as this is under clusterProcessPacket.

Signed-off-by: Zhijun <dszhijun@gmail.com>
@zhijun42
Copy link
Contributor Author

@hpatro

Not really sure about the benefit of ip/port addition to the nodename / (human nodename seems valuable)

Node A could fire up multiple connections to the other node B. For example, when I was investigating a tricky scenario in #2811, I needed to look at which connection node A was using to send packets to node B and why that connection was created/closed, which helped me figure out the ordering of packets. A single TCP/TLS connection can guarantee the delivery ordering of packets, but with multiple connections it's not clear which packets would be received first, and I needed to know that.

and wouldn't it be always the inbound link from another node as this is under clusterProcessPacket.

No, it can be either inbound or outbound.

  • Say node A sends a PING to node B via node A's outbound link, node B will reply with a PONG via the same connection. So node A will run clusterProcessPacket with link being outbound.
  • Say node B performs PONG broadcast to send PONGs to all other nodes, meaning node B proactively sends packets out (in contrast to passively replying messages). Then node A will run clusterProcessPacket with link being inbound.

If you turn on verbose/debug logging level and look at the logs carefully you will clearly see this pattern.

@zhijun42
Copy link
Contributor Author

@hpatro Could you approve if this looks good to you? ☺️

@zhijun42
Copy link
Contributor Author

Looks like the known flakiness still exists

*** [err]: Primaries will not time out then they are elected in the same epoch in tests/unit/cluster/failover2.tcl
expected message found in log file: *Failover attempt expired*

@hpatro hpatro merged commit da3c43d into valkey-io:unstable Nov 26, 2025
55 of 56 checks passed
@zhijun42 zhijun42 deleted the cluster-enhance-debuging branch November 26, 2025 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants