Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

router: poor performance over real NICs, despite decent performance over veth #4593

Open
Tracked by #4413
jiceatscion opened this issue Aug 6, 2024 · 3 comments
Open
Tracked by #4413
Labels
c/router SCION Router workitem Something needs doing

Comments

@jiceatscion
Copy link
Contributor

The router code itself can demonstrably forward 800K small packets per second and 10Gb/s of traffic in larger (2K packets) when benchmarked over veth. However, the observed performance is less that 1/2 that when using real NICs, including 10GigE NICs.

Since the processing code isn't the bottleneck, it has to be either the effect of the real NICs activity on the overall system (e.g. the interrupt processing overhead), or the impact of the API used by the router (regular UDP socket) on real I/O versus virtual.

Creating this work item to track investigation and resolution.

@jiceatscion jiceatscion added the workitem Something needs doing label Aug 6, 2024
@jiceatscion jiceatscion added the c/router SCION Router label Aug 6, 2024
@jiceatscion
Copy link
Contributor Author

jiceatscion commented Oct 1, 2024

The ultimate performance would come from using some XDP-based approach. The DPDK framework seems less work than doing it from scratch (https://doc.dpdk.org/guides/index.html); however, using from Go code isn't trivial either; but apparently feasible (https://pkg.go.dev/github.com/yerden/go-dpdk, https://pkg.go.dev/github.com/millken/dpdk-go).

This article makes a series of less involved suggestions that do not require XDP/DPDK: https://medium.com/@pavel.odintsov/capturing-packets-in-linux-at-a-speed-of-millions-of-packets-per-second-without-using-third-party-ef782fe8959d may be that's a worthy first step. There is a significant drawback: the ring API is meant for traffic sniffing; packets which destination does match the local host are duplicated: one copy to the sniffer's ring and the other to the kernel's network stack. To prevent that we'd need to play games; possibly games that nullify the benefits, but may be not. One way is to filter the traffic with https://tldp.org/HOWTO/Adv-Routing-HOWTO/lartc.adv-qdisc.ingress.html (here's a tutorial: https://www.dasblinkenlichten.com/working-with-tc-on-linux-systems/)

If my understanding is correct. The following sequence would suppress all traffic from eth0 to the kernel stack, leaving only the ring's copy:

$ tc qdisc add dev eth0 ingress
$ tc filter add dev eth0  parent ffff: matchall action drop

... most likely the packet still gets copied before being dropped. Too bad.

@jiceatscion
Copy link
Contributor Author

jiceatscion commented Oct 18, 2024

I did a little more research to clarify the (fast changing) state of affairs with the Linux kernel and high-performance networking. There seems to be, at the moment, three main avenues:

  1. The regular AF_PACKET API, which most advanced feature set stopped at so-called V3.
  2. The AF_XDP API, which replaced AF_PACKET V4 and adds various improvements by leaning on eBPF.
  3. DPDK, which is a framework that uses any of the above PLUS has the ability to bypass even more of the kernel by leaning on more eBPF trickery.

The transition from AF_PACKET v4 to AF_XDP is originally described by its author (Björn Töpel) here: https://lwn.net/Articles/745934/ or, in a more polished version, here: https://www.kernel.org/doc/html/v4.18/networking/af_xdp.html
Strangely enough, the article by Pavel Odintsov linked above contradicts Björn Töpel's results by showing much better performance in AF_PACKET v3 than Björn Töpel mentions (so, at least one of them was bullshitting mistaken).

The AF_XDP API seems reasonably accessible. It has several levels of acceleration available depending on the driver's capabilities and will make the best of it. It does require loading an eBPF program, it seems. However that program is a canned well-known one that some libraries (including Go ones: https://pkg.go.dev/github.com/liheng562653799/xdp) can deal with for you (or may be that's been automated away by now - that was on the 2DO list).

The DPDK framework will also use AF_XDP behind the scenes, unless it has a user-mode driver for the device that's being used. In which case it will switch to a very different approach and practically map the device in user space. Using the framework seems a lot more involved and so is only worth it if we want to access the user-mode driver performance. There are Go libraries that can expose the DPDK framework, though; as mentionned previously.

I found that this doc provided some interesting insight into performance expectations: https://fast.dpdk.org/doc/perf/DPDK_23_03_Intel_NIC_performance_report.pdf

The relative performance of the various approaches is the subject of contradictory reports, but roughly:

  • AF_PACKET V3: 750kps - 9Mps, depending on whom you ask. Numbers to be verified independently
  • AF_XDP: 3Mps - 20Mps, depending on level of support in driver
  • DPDK user mode driver: 36Mps - 128Mps (absolute best, reported by Intel for one of their own flagship NICs).

@jiceatscion
Copy link
Contributor Author

jiceatscion commented Oct 25, 2024

In order to preserve portability to non-linux platforms, we'll need to preserve the plain IP/UDP socket code. Likewise, it is possible that the AF_XDP code isn't as portable as the AF_PACKET with RINGS code, so we probably want to keep both if AF_XDP is much better. Action plan:

  • Address generic I/O efficiency issues (e.g. Interrupt load-balancing).
  • Refactor code to enable multiple packet I/O implementations. Cut just high enough to move the plain-socket related code below that interface.
  • Add a second implementation that uses AF_PACKET without attempting to use RX/TX rings.
  • Improve the second implementation to use RX/TX rings.
  • Add a third implementation that uses AF_XDP.
  • Add checksum offloading to AF_PACKET and AF_XDP implementations if possible.

@jiceatscion jiceatscion added this to the High performance router milestone Oct 25, 2024
@jiceatscion jiceatscion changed the title router: poor performance over real NICs, despite descent performance over veth router: poor performance over real NICs, despite decent performance over veth Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/router SCION Router workitem Something needs doing
Projects
None yet
Development

No branches or pull requests

1 participant