🧠 P4 In-Network Aggregation for Federated Learning

Accelerating Federated Learning by moving gradient aggregation directly into programmable network switches using P4.

This project demonstrates the concept of In-Network Intelligence, where the gradient aggregation phase of Federated Learning (FL) is executed directly inside a programmable switch instead of a centralized server.

By performing the aggregation within the network data plane, we reduce network congestion, minimize server workload, and enable ultra-low latency distributed training, following the principles of the SwitchML architecture.

📖 Table of Contents

Overview
The Problem
The Solution
Tech Stack
Handling P4 Limitations
Project Architecture
Network Topology
How to Run
Expected Output
Key Advantages
Future Improvements

🌐 Overview

Federated Learning (FL) allows multiple distributed clients to collaboratively train a machine learning model without sharing their raw data.

In traditional FL systems:

Each worker trains locally.
Workers send gradients to a Parameter Server.
The server aggregates gradients.
The updated model is broadcast back to workers.

However, this architecture creates a severe network bottleneck when model sizes become large.

This project solves that problem by moving gradient aggregation into the network switch itself using P4.

🚨 The Problem

In traditional Federated Learning:

Each worker sends large gradient vectors to the server.
The server must wait for N packets from N workers.
It then sums the weights and redistributes the result.

Issues

❌ Heavy network congestion
❌ Server CPU overhead
❌ High latency
❌ NIC bandwidth saturation

For modern neural networks containing millions of parameters, this approach becomes inefficient.

💡 The Solution

Using programmable switches with P4, we intercept worker packets and perform gradient aggregation inside the network switch.

Workflow

Workers compute gradients.
Gradients are quantized and packed into custom packets.
Packets pass through the P4 programmable switch.
The switch:
- Reads the gradient values
- Aggregates them using stateful registers
- Drops intermediate packets
After receiving contributions from all workers, the switch sends only ONE aggregated packet to the server.

Result

Traditional FL
Workers → Switch → Server
N packets received

In-Network Aggregation
Workers → P4 Switch (Aggregation) → Server
1 packet received

This dramatically reduces network load and improves efficiency.

🛠 Tech Stack

Technology	Purpose
P4 (p4-16)	Data plane programming language
BMv2 simple_switch	Software switch architecture
Mininet	Virtual network topology
Scapy (Python)	Custom packet creation & sniffing
Docker	Containerized networking environment

⚠️ Handling P4 Limitations

1️⃣ No Floating Point Support

P4 switches support integer ALU operations only, but neural network weights are floating point.

Solution

Workers quantize gradients:

quantized_weight = float_weight * 1000

Converted to 32-bit signed integers
Sent inside the packet
Server dequantizes after aggregation

2️⃣ Stateful Memory Management

Switches must maintain running sums across multiple packets.

Solution

P4 register arrays store partial sums:

registers[param_index] += incoming_weight

This allows the switch to maintain persistent aggregation state.

🏗 Project Architecture

Custom FL Packet Header

The custom header fl_update_t is encapsulated inside:

Ethernet → IPv4 → UDP → FL Header

It contains:

Field	Description
worker_id	ID of the worker sending gradients
param_index	Index of parameter chunk
w1, w2, w3, w4	Quantized gradient values

Components

👨‍💻 Workers (h1, h2, h3)

Generate random gradient vectors
Quantize weights
Send packets using Scapy

🔀 P4 Switch (s1)

Responsibilities:

Parse custom packet header
Read current weight sums
Perform integer addition
Maintain aggregation state
Forward only the final aggregated packet

🖥 Parameter Server (h4)

Listens on UDP port 5555
Receives aggregated packet
Dequantizes weights
Uses them for model update

🌍 Network Topology

        h1 (Worker)
           \
        h2 (Worker) ---- s1 (P4 Switch) ---- h4 (Server)
           /
        h3 (Worker)

Workers send gradients → switch aggregates → server receives single packet.

🐳 How to Run (Using Docker)

The entire environment is fully containerized.

No need to install P4, Mininet, or BMv2 locally.

1️⃣ Build and Start Container

docker-compose up -d --build

2️⃣ Start the Network Topology

Compile the P4 program and start Mininet.

docker exec -it p4_fl_aggregator_container python3 run_mininet.py

⚠️ Keep this terminal open.

3️⃣ Start the Parameter Server

Open a new terminal:

docker exec -it p4_fl_aggregator_container mx h4 python3 client/server.py

4️⃣ Start Worker 1

Open another terminal:

docker exec -it p4_fl_aggregator_container mx h1 python3 client/worker.py --id 1

5️⃣ Start Workers 2 & 3

Open another terminal:

docker exec -it p4_fl_aggregator_container /bin/bash -c "mx h2 python3 client/worker.py --id 2 & mx h3 python3 client/worker.py --id 3"

6️⃣ Execute Workers

Press Enter in:

Worker 1 terminal
Immediately press Enter in Worker 2/3 terminal

This simulates concurrent gradient updates.

🎯 Expected Output

On the Parameter Server terminal, you will observe:

Multiple worker packets entering the switch
Switch aggregating gradients
Single aggregated packet received per parameter chunk

Example:

Received aggregated gradients:
Chunk 1 → [10234, -3422, 8765, 2231]
Chunk 2 → [4521, 2234, -1234, 7642]

This proves the aggregation occurred inside the P4 switch.

⚡ Key Advantages

✔ Reduces network congestion
✔ Minimizes server computation load
✔ Enables line-rate aggregation
✔ Demonstrates in-network computing
✔ Inspired by SwitchML architecture

🔮 Future Improvements

Support more workers
Implement vectorized gradient aggregation
Integrate with real ML frameworks
Deploy on hardware programmable switches
Add secure aggregation mechanisms

📚 Inspiration

This project is inspired by the research system:

SwitchML — Scaling Distributed Machine Learning with In-Network Aggregation

👨‍💻 Author

Sadvik Kumar
B.Tech CSE (AI & ML)

⭐ If you found this project interesting, consider starring the repository!

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
client		client
controller		controller
p4		p4
topology		topology
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
p4app.json		p4app.json
run_mininet.py		run_mininet.py
srv_out.txt		srv_out.txt
topology.json		topology.json

Folders and files

Latest commit

History

Repository files navigation

🧠 P4 In-Network Aggregation for Federated Learning

📖 Table of Contents

🌐 Overview

🚨 The Problem

Issues

💡 The Solution

Workflow

Result

🛠 Tech Stack

⚠️ Handling P4 Limitations

1️⃣ No Floating Point Support

Solution

2️⃣ Stateful Memory Management

Solution

🏗 Project Architecture

Custom FL Packet Header

Components

👨‍💻 Workers (h1, h2, h3)

🔀 P4 Switch (s1)

🖥 Parameter Server (h4)

🌍 Network Topology

🐳 How to Run (Using Docker)

1️⃣ Build and Start Container

2️⃣ Start the Network Topology

3️⃣ Start the Parameter Server

4️⃣ Start Worker 1

5️⃣ Start Workers 2 & 3

6️⃣ Execute Workers

🎯 Expected Output

⚡ Key Advantages

🔮 Future Improvements

📚 Inspiration

👨‍💻 Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages