Skip to content

Conversation

@zeeshanlakhani
Copy link
Collaborator

@zeeshanlakhani zeeshanlakhani commented Sep 25, 2025

Implements end-to-end multicast networking across Omicron's control plane and sled-agent, integrated with IP pool extensions from #9084.

Closes #8242.

TL;DR:

Implements fleet-wide multicast groups across the control plane and sled-agent, integrated with IP pool extensions (#9084). Adds a reconciliation worker (RPW), inventory-based sled→switch-port mapping, a multi-switch multicast dataplane trait, and paired external/underlay groups for NAT and Source-Specific Multicast (SSM). Introduces fleet-scoped auth and a 3-state membership lifecycle; requires schema v209 and sled-agent API v7; feature is disabled by default.

Highlights:

  • An RPW for reconciling groups and instance members (ensuring dataplane state matches DB)
    • Inventory-based sled→switch-port mapping with validation tests
  • A multicast-focused dataplane trait separating control plane logic from Dendrite/DPD; works across multiple switches
  • Bifurcated architecture with paired external/underlay groups for NAT-based forwarding
  • 3-state instance member lifecycle ("Joining" → "Joined" → "Left") with reactivation support
  • Fleet-scoped authorization model allowing cross-project multicast
  • New DB tables: multicast_group, underlay_multicast_group, multicast_group_member
  • External groups: Customer-facing IPv4/IPv6 addresses from IP pools with SSM support
  • Underlay groups: Admin-scoped IPv6 (ff04::/16); default allocation from fixed ff04::/64 for internal rack forwarding
  • Feature flag and reconciler/cache settings exist and default to disabled/safe values
  • Member states: "Joining"/"Joined"/"Left" with soft-delete/mark-for-removal for instance lifecycle
  • Group states: "Creating"/"Active"/"Deleting"/"Deleted" for RPW processing
  • sled-agent: API v7 with multicast join/leave endpoints
  • Inventory / Port correlation
    • Validates baseboard identifiers match between sleds and SPs
    • Required for multicast reconciler to map sled_id → rear switch-ports (backplane) for instances
  • mvlan: External groups support an optional Multicast VLAN for (eventual) upstream egress
  • Updates to instance sagas as Nexus passes memberships to sled-agent via InstanceSledLocalConfig.multicast_groups

API Endpoints:

  • GET /v1/multicast-groups: List fleet multicast groups
  • POST /v1/multicast-groups: Create multicast group
  • GET /v1/multicast-groups/{group}: View group details
  • PUT /v1/multicast-groups/{group}: Update group (name, sources)
  • DELETE /v1/multicast-groups/{group}: Delete group
  • GET /v1/multicast-groups/{group}/members: List group members
  • POST /v1/multicast-groups/{group}/members: Add instance to group
  • DELETE /v1/multicast-groups/{group}/members/{instance}: Remove instance from group
  • GET /v1/instances/{instance}/multicast-groups: List groups for an instance
  • PUT /v1/instances/{instance}/multicast-groups/{group}: Join instance to group
  • DELETE /v1/instances/{instance}/multicast-groups/{group}: Leave group
  • GET /v1/system/multicast-groups/by-ip/{address}: Lookup group by IP address

The instance-scoped endpoints provide an alternative interface for the same join/leave operations, and there's also the system-level IP lookup endpoint.

New Sagas:

  • multicast_group_dpd_ensure: Ties together external/underlay creation of groups on all switches
  • multicast_group_dpd_update: Updates group configuration across switches

Breaking Changes:

  • sled-agent API version bump from v6 to v7
  • New required configuration in Nexus (multicast.enabled flag, reconciler period, and cache TTL settings)
  • Schema migration required (v208v209)

Migration Notes:

  • Multicast as a feature is disabled by default for safe rollout
  • Multicast endpoints are marked as "experimental"

References:

This work introduces multicast IP pool capabilities to support external
multicast traffic routing through the rack's switching infrastructure.

Includes:
  - Add IpPoolType enum (unicast/multicast) with unicast as default
  - Add multicast pool fields: switch_port_uplinks (UUID[]), mvlan (VLAN ID)
  - Add database migration (multicast-support/up01.sql) with new columns and indexes
  - Add ASM/SSM range validation for multicast pools to prevent mixing
  - Add pool type-aware resolution for IP allocation
  - Add custom deserializer for switch port uplinks with deduplication
  - Update external API params/views for multicast pool configuration
  - Add SSM constants (IPV4_SSM_SUBNET, IPV6_SSM_FLAG_FIELD) for validation

Database schema updates:
  - ip_pool table: pool_type, switch_port_uplinks, mvlan columns
  - Index on pool_type for efficient filtering
  - Migration preserves existing pools as unicast type by default

This provides the foundation for multicast group functionality while
maintaining full backward compatibility with existing unicast pools.

References (for review):
  - RFD 488: https://rfd.shared.oxide.computer/rfd/488
  - Dendrite PRs (based on recency):
    * oxidecomputer/dendrite#132
    * oxidecomputer/dendrite#109
    * oxidecomputer/dendrite#14
@zeeshanlakhani zeeshanlakhani changed the title Zl/mcast impl [feat] Multicast Group Support Sep 25, 2025
@zeeshanlakhani zeeshanlakhani changed the base branch from main to zl/ip-pool-multicast-support September 25, 2025 16:04
@zeeshanlakhani zeeshanlakhani self-assigned this Sep 25, 2025
@zeeshanlakhani zeeshanlakhani changed the title [feat] Multicast Group Support [feat, multicast] Multicast Group Support Sep 25, 2025
Introduces end-to-end multicast group support across control plane and sled-agent, integrated with IP pool extensions required
for supporting multicast workflows. This work enables project-scoped multicast groups with lifecycle-driven dataplane programming
and exposes an API for operating multicast groups over instances.

Highlights:
  - DB: new multicast_group tables; member lifecycle management
  - API: multicast group/member CRUD; source IP validation; VPC/project hierarchy integration with default VNI fallback
  - Control plane: RPW reconcilers for groups/members; sagas for dataplane updates atomically at the group level; instance lifecycle hooks and piggybacking
  - Dataplane: Dendrite DPD switch programming via trait abstraction; DPD client used in tests
  - Sled agent: multicast-aware instance management; network interface configuration for multicast traffic; cross-version testing; OPTE stubs present
  - Tests: comprehensive integration suites under nexus/tests/integration_tests/multicast/

Components:
  - Database schema: external and underlay multicast groups; member/instance association tables
  - Control plane modules: multicast group management, member lifecycle, dataplane abstraction; RPW reconcilers to ensure convergence
  - API layer: endpoints and validation; default-VNI semantics when VPC not provided
  - Sled agent: OPTE stubs and compatibility shims for older agents

Workflows Implemented:
  1. Instance lifecycle integration:

     - "Create" -> resolve VPC/VNI (or default), validate source IPs, create memberships, enqueue group ensure RPW
     - "Start" -> program dataplane via ensure/update sagas; activate member flows after switch ack
     - "Stop" -> deactivate dataplane membership; retain DB membership for fast restart
     - "Delete" -> remove instance memberships; group deletion is explicit
     - "Migrate" -> deactivate on source sled; activate on target; idempotent with ordering guarantees
     - Restart/recovery -> RPWs reconcile desired state; compensations clean up partial programming

  2. RPW reconciliation:

     - ensure dataplane switches match database state
     - handle sled migrations and state transitions
     - Eventual consistency with retry logic

Migrations:
  - Apply schema changes in schema/crdb/multicast-group-support/up01.sql (and update dbinit.sql)
  - Bump schema versions accordingly

API/Compatibility:
  - OpenAPI updated: openapi/nexus.json, openapi/sled-agent/sled-agent-5.0.0-89f1f7.json
  - Contains a version change (to v5) as InstanceEnsureBody has been modified to
    include multicast_groups associated with an instance in the
    underlying sled config
  - Regenerate clients where applicable

References:
  - RFD 488: https://rfd.shared.oxide.computer/rfd/488
  - IP Pool extensions: #9084
  - Dendrite PRs (based on recency):
    * oxidecomputer/dendrite#132
    * oxidecomputer/dendrite#109
    * oxidecomputer/dendrite#14

Follow-ups include:
  - OPTE integration
  - commtest extension
  - omdb commands are tracked in issues
  - pool and group stats
…sed on config

Being that we still have OPTE and Maghemite updates to come for statically routed multicast,
we gate RPW and Saga actions behind runtime configuration ("on" for tests). API calls
are tagged "experimental."
@zeeshanlakhani
Copy link
Collaborator Author

@internet-diglett, others, I added "feature-gating" to this PR, as well as "experimental" tagging for the new entrypoints.

Copy link
Contributor

@rcgoodfellow rcgoodfellow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few API questions to start out with.

Includes:
  * Documentation cleanup across the board
  * Schema+Model
    - Remove rack_id from ExternalMulticastGroup model and database schema
  * Reconciler -> Backplane Port Resolution + Refactor `handle` fns
    - Add sled → switch port mapping cache with TTL
    - Fetch backplane map from DPD for topology validation
    - Resolve sled_id → SP (via inventory collection call) → sp_slot → rear port
    - Validate sp_slot values against hardware backplane map
    - Cache mappings per-sled with automatic invalidation on topology
      changes
    - Refactor member state processing logic
  * Dataplane Client
    - Add fetch_backplane_map() for topology validation from DPD-client
    - Refactor drift detection and better logging
    - Extend member add/remove operations with port resolution
  * Simulation Infrastructure
    - Add FAKE_GIMLET_MODEL constant ("i86pc") in sp-sim
    - Update sled-agent-sim to use sp_sim::FAKE_GIMLET_MODEL
    - Add for_testing_with_baseboard() helper for custom baseboard configs
    - Enables inventory-based sled/SP matching in tests
  * Testing
    - Add integration_tests/inventory_matching.rs test
    - Update multicast tests for inventory-based port resolution
    - Add ensure_inventory_ready() helper for RPW reconciler tests
  * Config
    - nexus_config additions for cache TTLs, etc
@zeeshanlakhani
Copy link
Collaborator Author

@rcgoodfellow @internet-diglett @FelixMcFelix ok, this is back up for review.

Copy link
Contributor

@rcgoodfellow rcgoodfellow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zeeshanlakhani. Changes address my primary concerns. A few questions here on cache invalidation.

Sorry if I missed them, but if they're not here can we add some tests around the cases where we expect cache invalidation to kick in e.g. inventory changes and ttl timeouts?

Let's also sync up with @askfongjojo on taking this on a lap through the product assurance test suite.

Includes:
  - Stale port cleanup: When cache invalidation occurs (manual or via
    topology changes), reconciler now removes members from old switch ports
    before programming new ones. Prevents stale forwarding state.
  - We now compute a union of active member ports across all "Joined"
    members to safely prune only stale ports
  - We also add a fallback removal path for when `sled_id` is
    unavailable on the verify path (e.g., member.sled_id is NULL or sled
    removed)
  - Wired up cache invalidation flag and inventory watchers:
    - Adds `AtomicBool` flag shared between reconciler and Nexus for manual
      cache invalidation signaling
    - Connects inventory collection/load watchers to reconciler to trigger
      automatic updates when topology changes
    - Reconciler clears invalidation flag after processing
  - Adds cache invalidation tests, better error handling, etc
@zeeshanlakhani
Copy link
Collaborator Author

@rcgoodfellow, cache updates in. @askfongjojo, lmk when you want to sync up on testing this. I already noted the enablement flag as well (since this is still experimental).

@askfongjojo
Copy link

askfongjojo commented Nov 10, 2025

Let's also sync up with @askfongjojo on taking this on a lap through the product assurance test suite.

I deployed the PR to a racklet and ran the same regression tests I did for a R17 release candidate and haven't observed any difference in functional behavior or TCP network I/O perf. The tests cover mainly firewall and VPC custom routes, VM-to-VM VPC subnet throughput, some storage benchmarks (which are more for exercising the propolis-to-crucible network path).

On this PR, I did see higher datagram loss % (>10%) under multiple threads, e.g.,

$ iperf3 -c 172.30.0.8 -u --bitrate 64M -P 8 -i 0
...
[SUM]   0.00-10.00  sec   610 MBytes   512 Mbits/sec  0.000 ms  0/441960 (0%)  sender
[SUM]   0.00-10.06  sec   449 MBytes   375 Mbits/sec  0.316 ms  116546/441955 (26%)  receiver

whereas on rack2, I typically get

[SUM]   0.00-10.00  sec   610 MBytes   512 Mbits/sec  0.000 ms  0/441752 (0%)  sender
[SUM]   0.00-10.01  sec   609 MBytes   511 Mbits/sec  0.099 ms  514/441752 (0.12%)  receiver

I'll re-test on another racklet to see if it's because of environment factors.

Update (11/10/2025): A re-test on the same racklet using a r17 release candidate yield similar datagram loss rate. So the issue is specific to the racklet environment, not this PR. Sorry about the confusion @zeeshanlakhani.

Copy link
Contributor

@rcgoodfellow rcgoodfellow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zeeshanlakhani. All my review comments have been addressed. Appreciate all the hard work in getting this initial infrastructure for multicast groups integrated into the control plane.

@zeeshanlakhani
Copy link
Collaborator Author

Thanks @rcgoodfellow. @internet-diglett, did you have any follow-ups or anything on the versioning? If not, I'll merge tonight/day.

@zeeshanlakhani zeeshanlakhani merged commit 18058fc into main Nov 15, 2025
18 checks passed
@zeeshanlakhani zeeshanlakhani deleted the zl/mcast-impl branch November 15, 2025 05:41
@zeeshanlakhani zeeshanlakhani changed the title [feat, multicast] Multicast Group Support [feat, multicast] Multicast Group+Member Support Nov 15, 2025
zeeshanlakhani added a commit to oxidecomputer/oxnet that referenced this pull request Nov 18, 2025
This aligns better with the current Multicast work in Omicron: oxidecomputer/omicron#9091. 

Includes:
  * Handle IPv4 Local (and org-local mcast)
  * Better naming conventions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Oxide control plane RPWs for implementing multicast API

4 participants