Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance #58

Open
bburas opened this issue May 23, 2023 · 5 comments
Open

performance #58

bburas opened this issue May 23, 2023 · 5 comments

Comments

@bburas
Copy link

bburas commented May 23, 2023

I am using the RDMA benchmark code to perform a latency test for send/recv using 100Gb/s Mellanox cards directly connected.
Seeing from 100-700us for a small string (12 bytes).
But from qperf rc_lat I get about 4 us.
Is this what I should expect?

   sendBuf.asCharBuffer().put(msg);
    sendBuf.clear();
    postSend.getWrMod(0).getSgeMod(0).setLength(msg.length());
    postSend.execute();
    endpoint.getWcEvents().take();
    postRecv.execute();
    endpoint.getWcEvents().take();
    recvBuf.clear();
@PepperJo
Copy link
Contributor

You should expect latency slightly higher than native maybe around 5-10us overhead max.
The code snippet you posted is problematic as you should post the receive request before the send is issued. Worst case the send from the other side is issued before the receive is posted and you will run into retries or on some transports your connections will be aborted.
I recommend running the benchmarks provided here: https://github.com/zrlio/disni/tree/master/src/test/java/com/ibm/disni/benchmarks

@bburas
Copy link
Author

bburas commented May 24, 2023

I refactored the original benchmark code you reference into separate

  1. init and connect
  2. send msg to server and recv response from server

but I still use the CustomClientEndpoint. The CustomClientEndpoint::init() does a pre post recv request as the last line:
System.out.println("SimpleClient::initiated recv");
this.postRecv(wrList_recv).execute().free();
But as you mentioned, if I run the benchmark code directly using the default 1000 loops I do see about 40us average which is not bad.
java com.ibm.disni.benchmarks.RDMAvsTcpBenchmarkClient -a 192.168.0.37 -p 20886 -s 1024

SimpleClient::initiated recv
RDMAvsTcpBenchmarkClient::client channel set up
RDMA result:
Total time: 38.721645 ms
Bidirectional bandwidth: 0.04925794430511669 Gbytes/s
Bidirectional bandwidth: 0.3940635544409335 Gbits/s
Bidirectional average latency: 0.038721645 ms

@PepperJo
Copy link
Contributor

I recommend running more than just 1000 loops. 38ms total runtime is probably not enough to get stable/good performance. You have to keep in mind that in Java it takes a while until all code path are JIT compiled so initially there is a lot more overhead.

@ShiningChuang
Copy link

ShiningChuang commented Jul 20, 2023

I recommend running more than just 1000 loops. 38ms total runtime is probably not enough to get stable/good performance. You have to keep in mind that in Java it takes a while until all code path are JIT compiled so initially there is a lot more overhead.

In fact I have increased the loop to 1,000,000 and the buffer size to 32 * 64 in the RDMAvsTcpBenchmarkClient and RDMAvsTcpBenchmarkServer tests, but the throughput and latency of the DISNI is not what is expected, and it is close to that of TCP, is this reasonable?

RDMA result:
Total time: 60238.736352 ms
Bidirectional bandwidth: 0.06332631619850285 Gb/s
Bidirectional average latency: 0.060238736352 ms
TCP result:
Total time: 63491.424003 ms
Bidirectional bandwidth: 0.060082087077535914 Gb/s
Bidirectional average latency: 0.063491424003 ms

@PepperJo
Copy link
Contributor

I do see a difference when I run it:

RDMA result:
Total time: 2836.468479 ms
Bidirectional bandwidth: 0.13448756063631195 Gb/s
Bidirectional average latency: 0.02836468479 ms
TCP result:
Total time: 5699.473495 ms
Bidirectional bandwidth: 0.06693069577341723 Gb/s
Bidirectional average latency: 0.05699473495 ms

That said, this benchmark is not good for comparison as it only uses one outstanding posted receive for RDMA (It's more a ping pong test rather then a good benchmark). I recommend you use SendRecvClient/Server if you are interested in send/recv numbers. While it doesn't allow to set preposted receives independently from sends it at least gives you an idea what performance can be like with higher QDs. If you want a "real" RDMA benchmark, i.e. using one-sided operations like RDMA read use ReadClient/Server instead of send/recv. I see around 3us read latency with that benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants