Skip to content

manually cherry-pick from public master branch#20

Merged
prabhataravind merged 14 commits intoAzure:202506from
zjswhhh:cp
Oct 1, 2025
Merged

manually cherry-pick from public master branch#20
prabhataravind merged 14 commits intoAzure:202506from
zjswhhh:cp

Conversation

@zjswhhh
Copy link

@zjswhhh zjswhhh commented Oct 1, 2025

5a6e537 (HEAD -> 202506) Make deserializer more forgiven to parse dash_bfd_probe_state (#113)
9dc1357 Proper handling of entry not found in incoming vs decoding error (#114)
d44084c Unregister handlers after actor terminates (#115)
e4aa6be Add local nexthop ip (#116)
a444e77 Move vnet_tunnel_route_table to producer bridge (#110)
a0f1616 Implement cleanup logic for all the actors (#102)
54a099e Fix ha state. (#107)
54e7cd5 fix show hamgrd actor command (#108)
bc017ee Implement mark-delete in actor framework (#104)
b48373a Update dpu scope state table name. (#106)
4f62566 Route exchange (#55)
1455ee2 Convert Unspecified to Standby in DashHaScopeTable. (#97)
30e836e Fix producer bridge CI issue (#105)
e6e11ba Adjust path of debs in common-libs artifact (#103)

sign-off: Jing Zhang zhangjing@microsoft.com

yue-fred-gao and others added 14 commits October 1, 2025 20:16
### why
pr checker failed due to common-libs are packaging bookworm debs instead
of bullseye.

### what this PR does
change bullseye to bookworm
### why
producer_state_table_bridge_check_dup test failed. Suspect swss-common
behaviour changed.

### what this PR does
Make the check more strict to make sure no entries are received from
consumer.
Translate "Unspecified" DesiredHaState from dash ha scope config to
"standby" in DashHaScopeTable
Implements route exchange feature specified in the wiki. For the detail
behaviour, see
https://github.com/sonic-net/sonic-dash-ha/wiki/SWBus-(Switch-Bus)#route-exchange-working-theory
### why
actor has the retry logic in outgoing state. If a message is not acked,
it will resend the message to make sure receiver has received it
successfully. When an actor is terminated, the retry will be terminated
as well so it can't guarantee the receiver getting the message.

### what this PR does
Introduce mark-delete concept. 
1. when an actor is going to terminate, add "mark_deleted" flag to the
driver of the actor.
2. In the run loop of the actor, which is triggered each time it
receives a message, it will check if the actor is ready_for_delete.
3. ready_for_delete checks if there is unacked message in outgoing
state. Only exits from the run loop when there is non
4. When an actor is in mark_deleted state, stop processing incoming
requests but always replies OK. So 2 mark_deleted actors won't form a
dead loop.
5. response is processed normally so unacked messages can be ACKed.
6. management_request is processed normally so we can still dump actor
state using swbus-cli
### why
show hamgrd actor command is broken after route_exchange PR is merged.
In log we can see this below error
Sep 5 12:11:25 ott-ss-010 swbusd: 2025-09-05T16:11:25.820184Z ERROR
ConnWorker{conn_id="swbs-from://127.0.0.1:39642"}: 96: Failed to process
the incoming message: Input:InvalidArgs - Invalid management request:
ManagementRequest { request: HamgrdGetActorState, arguments: [] }

This is because swbusd incorrectly intercepting all ManagementRequest.
### What this PR does
1. check if the ManagementRequest has swbusd's service path as
destination. If not, route the message
2. fix some misc issues exposed after above code change.
3. use init_logger_for_test from Logger and remove the proprietary
implementation.
local_ha_state was being set to ha_role, update so it is correctly set
to the DPU's ha_state.

This fixes issue #91
### why

This addresses issue #100. When upstream deletes the DB entry that is
the originator of the actor, the actor should cleanup all the db entries
it has created before terminating itself. For example, deleting
DashHaSetConfig entry should triggers the cleanup actor in the
corresponding HaSetActor, which includes removing DASH_HA_SET_TABLE it
creates in DPU_APPL_DB and VNET_ROUTE_TUNNEL_TABLE in APPL_DB.

### what this PR does
1. Implements cleanup for all the actors. 

- DpuActor: remove entries in DPU_APPL_DB/BFD_SESSION_TABLE
- VDpuActor: unregister from DpuActor
- HASetActor: remove entry from DPU_APPL_DB/DASH_HA_SET_TABLE, remove
entry from APPL_DB/VNET_ROUTE_TUNNEL_TABLE, unregistered from VDpuActor
- HAScopeActor: remove entry from DPU_APPL_DB/DASH_HA_SCOPE_TABLE,
remove entry from STATE_DB/DASH_HA_SCOPE_STATE and unregister from
VDpuActor and HaSetActor

3. Extend ChkDb macro to check a db entry doesn't exist
4. Extend Internal state with deleting an entry from db
### why
vnet_tunnel_route_table needs to be updated via ProducerStateTable to
properly trigger orchagent handlers

### what this PR does
move the table from internal state to outgoing state via producer bridge
Adding local_nexthop_ip so correct nexthop IP is used as endpoint in
VNET_ROUTE_TUNNEL_TABLE when DPU is local.
### why
when actor terminates itself, the handler in SwbusEdgeRuntime is not
removed. When a new actor is spawned with the same service path, it will
be ignored because the ActorCreator replies on "NoRoute" to discover new
actor.

### what this PR does
when ActorDriver exits from run loop, the SimpleSwbusEdgeClient it owns
will be destructed. From the destructor, handler will be removed.
This addresses issue #111
### why
currently Incoming::get returns error if the entry is not found and
caller typically propagates the error further. Sometimes it is normal
that an entry doesn't exist. It needs to be treated differently from
message decode error.

### what this PR does
Incoming::get returns Option. Caller needs to handle the None return
accordingly, which means the entry is not found.
### why
currently the deserializer for dash_bfd_probe_state has strict
requirements on the format of the fields. If it doesn't follow the
format, it will reject it. Specifically, the timestamp field is enclosed
in double-quotes, which caused parsing error.

### what this PR does
make the deserializer more forgiven with format. If the value has double
or single quotes or whitespaces, remove them first. If the value of
v4_bfd_up_sessions or v6_bfd_up_sessions has quotes or space between
comma, remove them.
@zjswhhh zjswhhh requested a review from prabhataravind October 1, 2025 20:23
@prabhataravind prabhataravind merged commit fea3794 into Azure:202506 Oct 1, 2025
2 checks passed
@zjswhhh zjswhhh deleted the cp branch October 1, 2025 23:45
zjswhhh pushed a commit to zjswhhh/sonic-dash-ha.msft that referenced this pull request Nov 19, 2025
- Swbusd is up with an initial routes loaded from yaml
 - swbusd can be interconnected and routing is working
 - swbuscli is implemented for troubleshooting 
 - ping in swbuscli to remote swbusd through local swbusd is working
 - show route in swbuscli displays route table in connected swbusd

 TODOs:
- Client reconnect to remote swbusd is implemented but without source
port randomization in each retry
 - Change logging from println to trace
 - Add unit tests
 - Implement route update through route queries

---------

Co-authored-by: r12f <r12f.code@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants