Skip to content

Conversation

@tiagolobocastro
Copy link
Member

It was seen that sometimes the connection to the replica is in an odd state where adminq keeps failing to send the keep alive, and the nexus subsystem pause is hung.

The theory is that there are stuck IOs somehow. Prior to 2.11 the timing out of replica IOs was not always correct so this could somehow be playing a part here.

To attempt to unstuck the hang, we now wait up to the adminq keep alive timeout and then in case the nexus is pausing we now force disconnect the bad replica.

It was seen that sometimes the connection to the replica is in an odd state
where adminq keeps failing to send the keep alive, and the nexus subsystem
pause is hung.

The theory is that there are stuck IOs somehow. Prior to 2.11 the timing out
of replica IOs was not always correct so this could somehow be playing a part
here.

To attempt to unstuck the hang, we now wait up to the adminq keep alive timeout
and then in case the nexus is pausing we now force disconnect the bad replica.

Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants