Skip to content

sadvik-asus/P4-In-Network-FederatedLearning-Aggregation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

2 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿง  P4 In-Network Aggregation for Federated Learning

Accelerating Federated Learning by moving gradient aggregation directly into programmable network switches using P4.

This project demonstrates the concept of In-Network Intelligence, where the gradient aggregation phase of Federated Learning (FL) is executed directly inside a programmable switch instead of a centralized server.

By performing the aggregation within the network data plane, we reduce network congestion, minimize server workload, and enable ultra-low latency distributed training, following the principles of the SwitchML architecture.


๐Ÿ“– Table of Contents


๐ŸŒ Overview

Federated Learning (FL) allows multiple distributed clients to collaboratively train a machine learning model without sharing their raw data.

In traditional FL systems:

  1. Each worker trains locally.
  2. Workers send gradients to a Parameter Server.
  3. The server aggregates gradients.
  4. The updated model is broadcast back to workers.

However, this architecture creates a severe network bottleneck when model sizes become large.

This project solves that problem by moving gradient aggregation into the network switch itself using P4.


๐Ÿšจ The Problem

In traditional Federated Learning:

  • Each worker sends large gradient vectors to the server.
  • The server must wait for N packets from N workers.
  • It then sums the weights and redistributes the result.

Issues

โŒ Heavy network congestion
โŒ Server CPU overhead
โŒ High latency
โŒ NIC bandwidth saturation

For modern neural networks containing millions of parameters, this approach becomes inefficient.


๐Ÿ’ก The Solution

Using programmable switches with P4, we intercept worker packets and perform gradient aggregation inside the network switch.

Workflow

  1. Workers compute gradients.
  2. Gradients are quantized and packed into custom packets.
  3. Packets pass through the P4 programmable switch.
  4. The switch:
    • Reads the gradient values
    • Aggregates them using stateful registers
    • Drops intermediate packets
  5. After receiving contributions from all workers, the switch sends only ONE aggregated packet to the server.

Result

Traditional FL
Workers โ†’ Switch โ†’ Server
N packets received

In-Network Aggregation
Workers โ†’ P4 Switch (Aggregation) โ†’ Server
1 packet received

This dramatically reduces network load and improves efficiency.


๐Ÿ›  Tech Stack

Technology Purpose
P4 (p4-16) Data plane programming language
BMv2 simple_switch Software switch architecture
Mininet Virtual network topology
Scapy (Python) Custom packet creation & sniffing
Docker Containerized networking environment

โš ๏ธ Handling P4 Limitations

1๏ธโƒฃ No Floating Point Support

P4 switches support integer ALU operations only, but neural network weights are floating point.

Solution

Workers quantize gradients:

quantized_weight = float_weight * 1000
  • Converted to 32-bit signed integers
  • Sent inside the packet
  • Server dequantizes after aggregation

2๏ธโƒฃ Stateful Memory Management

Switches must maintain running sums across multiple packets.

Solution

P4 register arrays store partial sums:

registers[param_index] += incoming_weight

This allows the switch to maintain persistent aggregation state.


๐Ÿ— Project Architecture

Custom FL Packet Header

The custom header fl_update_t is encapsulated inside:

Ethernet โ†’ IPv4 โ†’ UDP โ†’ FL Header

It contains:

Field Description
worker_id ID of the worker sending gradients
param_index Index of parameter chunk
w1, w2, w3, w4 Quantized gradient values

Components

๐Ÿ‘จโ€๐Ÿ’ป Workers (h1, h2, h3)

  • Generate random gradient vectors
  • Quantize weights
  • Send packets using Scapy

๐Ÿ”€ P4 Switch (s1)

Responsibilities:

  • Parse custom packet header
  • Read current weight sums
  • Perform integer addition
  • Maintain aggregation state
  • Forward only the final aggregated packet

๐Ÿ–ฅ Parameter Server (h4)

  • Listens on UDP port 5555
  • Receives aggregated packet
  • Dequantizes weights
  • Uses them for model update

๐ŸŒ Network Topology

        h1 (Worker)
           \
        h2 (Worker) ---- s1 (P4 Switch) ---- h4 (Server)
           /
        h3 (Worker)

Workers send gradients โ†’ switch aggregates โ†’ server receives single packet.


๐Ÿณ How to Run (Using Docker)

The entire environment is fully containerized.

No need to install P4, Mininet, or BMv2 locally.


1๏ธโƒฃ Build and Start Container

docker-compose up -d --build

2๏ธโƒฃ Start the Network Topology

Compile the P4 program and start Mininet.

docker exec -it p4_fl_aggregator_container python3 run_mininet.py

โš ๏ธ Keep this terminal open.


3๏ธโƒฃ Start the Parameter Server

Open a new terminal:

docker exec -it p4_fl_aggregator_container mx h4 python3 client/server.py

4๏ธโƒฃ Start Worker 1

Open another terminal:

docker exec -it p4_fl_aggregator_container mx h1 python3 client/worker.py --id 1

5๏ธโƒฃ Start Workers 2 & 3

Open another terminal:

docker exec -it p4_fl_aggregator_container /bin/bash -c "mx h2 python3 client/worker.py --id 2 & mx h3 python3 client/worker.py --id 3"

6๏ธโƒฃ Execute Workers

Press Enter in:

  1. Worker 1 terminal
  2. Immediately press Enter in Worker 2/3 terminal

This simulates concurrent gradient updates.


๐ŸŽฏ Expected Output

On the Parameter Server terminal, you will observe:

  • Multiple worker packets entering the switch
  • Switch aggregating gradients
  • Single aggregated packet received per parameter chunk

Example:

Received aggregated gradients:
Chunk 1 โ†’ [10234, -3422, 8765, 2231]
Chunk 2 โ†’ [4521, 2234, -1234, 7642]

This proves the aggregation occurred inside the P4 switch.


โšก Key Advantages

โœ” Reduces network congestion
โœ” Minimizes server computation load
โœ” Enables line-rate aggregation
โœ” Demonstrates in-network computing
โœ” Inspired by SwitchML architecture


๐Ÿ”ฎ Future Improvements

  • Support more workers
  • Implement vectorized gradient aggregation
  • Integrate with real ML frameworks
  • Deploy on hardware programmable switches
  • Add secure aggregation mechanisms

๐Ÿ“š Inspiration

This project is inspired by the research system:

SwitchML โ€” Scaling Distributed Machine Learning with In-Network Aggregation


๐Ÿ‘จโ€๐Ÿ’ป Author

Sadvik Kumar
B.Tech CSE (AI & ML)


โญ If you found this project interesting, consider starring the repository!

About

P4-based in-network aggregation system for Federated Learning that performs gradient summation directly inside programmable switches, reducing network congestion and server overhead while demonstrating the SwitchML paradigm using BMv2, Mininet, Scapy, and Docker.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors