Accelerating Federated Learning by moving gradient aggregation directly into programmable network switches using P4.
This project demonstrates the concept of In-Network Intelligence, where the gradient aggregation phase of Federated Learning (FL) is executed directly inside a programmable switch instead of a centralized server.
By performing the aggregation within the network data plane, we reduce network congestion, minimize server workload, and enable ultra-low latency distributed training, following the principles of the SwitchML architecture.
- Overview
- The Problem
- The Solution
- Tech Stack
- Handling P4 Limitations
- Project Architecture
- Network Topology
- How to Run
- Expected Output
- Key Advantages
- Future Improvements
Federated Learning (FL) allows multiple distributed clients to collaboratively train a machine learning model without sharing their raw data.
In traditional FL systems:
- Each worker trains locally.
- Workers send gradients to a Parameter Server.
- The server aggregates gradients.
- The updated model is broadcast back to workers.
However, this architecture creates a severe network bottleneck when model sizes become large.
This project solves that problem by moving gradient aggregation into the network switch itself using P4.
In traditional Federated Learning:
- Each worker sends large gradient vectors to the server.
- The server must wait for N packets from N workers.
- It then sums the weights and redistributes the result.
โ Heavy network congestion
โ Server CPU overhead
โ High latency
โ NIC bandwidth saturation
For modern neural networks containing millions of parameters, this approach becomes inefficient.
Using programmable switches with P4, we intercept worker packets and perform gradient aggregation inside the network switch.
- Workers compute gradients.
- Gradients are quantized and packed into custom packets.
- Packets pass through the P4 programmable switch.
- The switch:
- Reads the gradient values
- Aggregates them using stateful registers
- Drops intermediate packets
- After receiving contributions from all workers, the switch sends only ONE aggregated packet to the server.
Traditional FL
Workers โ Switch โ Server
N packets received
In-Network Aggregation
Workers โ P4 Switch (Aggregation) โ Server
1 packet received
This dramatically reduces network load and improves efficiency.
| Technology | Purpose |
|---|---|
| P4 (p4-16) | Data plane programming language |
| BMv2 simple_switch | Software switch architecture |
| Mininet | Virtual network topology |
| Scapy (Python) | Custom packet creation & sniffing |
| Docker | Containerized networking environment |
P4 switches support integer ALU operations only, but neural network weights are floating point.
Workers quantize gradients:
quantized_weight = float_weight * 1000
- Converted to 32-bit signed integers
- Sent inside the packet
- Server dequantizes after aggregation
Switches must maintain running sums across multiple packets.
P4 register arrays store partial sums:
registers[param_index] += incoming_weight
This allows the switch to maintain persistent aggregation state.
The custom header fl_update_t is encapsulated inside:
Ethernet โ IPv4 โ UDP โ FL Header
It contains:
| Field | Description |
|---|---|
| worker_id | ID of the worker sending gradients |
| param_index | Index of parameter chunk |
| w1, w2, w3, w4 | Quantized gradient values |
- Generate random gradient vectors
- Quantize weights
- Send packets using Scapy
Responsibilities:
- Parse custom packet header
- Read current weight sums
- Perform integer addition
- Maintain aggregation state
- Forward only the final aggregated packet
- Listens on UDP port 5555
- Receives aggregated packet
- Dequantizes weights
- Uses them for model update
h1 (Worker)
\
h2 (Worker) ---- s1 (P4 Switch) ---- h4 (Server)
/
h3 (Worker)
Workers send gradients โ switch aggregates โ server receives single packet.
The entire environment is fully containerized.
No need to install P4, Mininet, or BMv2 locally.
docker-compose up -d --buildCompile the P4 program and start Mininet.
docker exec -it p4_fl_aggregator_container python3 run_mininet.pyOpen a new terminal:
docker exec -it p4_fl_aggregator_container mx h4 python3 client/server.pyOpen another terminal:
docker exec -it p4_fl_aggregator_container mx h1 python3 client/worker.py --id 1Open another terminal:
docker exec -it p4_fl_aggregator_container /bin/bash -c "mx h2 python3 client/worker.py --id 2 & mx h3 python3 client/worker.py --id 3"Press Enter in:
- Worker 1 terminal
- Immediately press Enter in Worker 2/3 terminal
This simulates concurrent gradient updates.
On the Parameter Server terminal, you will observe:
- Multiple worker packets entering the switch
- Switch aggregating gradients
- Single aggregated packet received per parameter chunk
Example:
Received aggregated gradients:
Chunk 1 โ [10234, -3422, 8765, 2231]
Chunk 2 โ [4521, 2234, -1234, 7642]
This proves the aggregation occurred inside the P4 switch.
โ Reduces network congestion
โ Minimizes server computation load
โ Enables line-rate aggregation
โ Demonstrates in-network computing
โ Inspired by SwitchML architecture
- Support more workers
- Implement vectorized gradient aggregation
- Integrate with real ML frameworks
- Deploy on hardware programmable switches
- Add secure aggregation mechanisms
This project is inspired by the research system:
SwitchML โ Scaling Distributed Machine Learning with In-Network Aggregation
Sadvik Kumar
B.Tech CSE (AI & ML)
โญ If you found this project interesting, consider starring the repository!