Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor handling of orphaned resources in the coordinator #1580

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jluebbe
Copy link
Member

@jluebbe jluebbe commented Jan 24, 2025

Labgrid currently uses bidirectional streams between coordinator and
client/exporter. For the client side this is a good fit, since the
client sends requests and the coordinator can answer directly. However
for the exporter we have a case where nested calls were used in the old
crossbar infrastructure, namely when re-acquiring a resource after the
exporter was offline but the place was kept acquired. We call these
orphaned resources. They replace the real resource on the coordinator
side until the resource can be reacquired on the respective exporter
after it has restarted.

With crossbar, when seeing the resource update, the coordinator could
directly call the exporter to acquire the resource for the specific
place. This was possible since crossbar did the RPC route handling and
arbitrary services connected to the crossbar could provide RPC calls to
the service. With GRPC, we are more constrained. Since we only have a
single Input/Output stream which needs to multiplex different objects,
nested calls are not directly supported, since the exporter side would
still wait for the coordinator to answer its own request.

A different approach to orphaned resource handling is required. The
coordinator now uses a loop where it checks the orphaned resources and
tries to reacquire them if the exporter reappears. This however
introduces another problem, the exporter can be under high load and thus
the acquire request from the coordinator can time out. In this case, we
need to abort the acquisition during a regular lock and in case of an
orphaned resource need to replace the orphaned resource with the
eventually acquired resource from the exporter.

We also need to handle the case where the exporter has an acquired
resource, but the place has been released in the meantime (perhaps due
to a timeout on a normal place acquire), the same poll loop handles this
in the coordinator as well.

All in all this means that the resource acquired state for each place is
not necessarily consistent on the coordinator, but will reach an
eventual consistent state. This should be sufficient, since exporter
restarts with orphaned resources should be relatively rare.

Emantor and others added 2 commits January 24, 2025 14:18
Check if a place acquire request tries to lock a resource that is marked
as orphaned. This means we no longer need to reacquire orphaned resource
before trying to acquire a places resources since we now refuse to
acquire the resources.

This avoids long delays on aquire calls if the exporter responsible for
an (unrelated) orphaned resource doesn't process commands quickly.

Signed-off-by: Rouven Czerwinski <r.czerwinski@pengutronix.de>
Labgrid currently uses bidirectional streams between coordinator and
client/exporter. For the client side this is a good fit, since the
client sends requests and the coordinator can answer directly. However
for the exporter we have a case where nested calls were used in the old
crossbar infrastructure, namely when re-acquiring a resource after the
exporter was offline but the place was kept acquired. We call these
orphaned resources. They replace the real resource on the coordinator
side until the resource can be reacquired on the respective exporter
after it has restarted.

With crossbar, when seeing the resource update, the coordinator could
directly call the exporter to acquire the resource for the specific
place. This was possible since crossbar did the RPC route handling and
arbitrary services connected to the crossbar could provide RPC calls to
the service. With GRPC, we are more constrained. Since we only have a
single Input/Output stream which needs to multiplex different objects,
nested calls are not directly supported, since the exporter side would
still wait for the coordinator to answer its own request.

A different approach to orphaned resource handling is required. The
coordinator now uses a loop where it checks the orphaned resources and
tries to reacquire them if the exporter reappears. This however
introduces another problem, the exporter can be under high load and thus
the acquire request from the coordinator can time out. In this case, we
need to abort the acquisition during a regular lock and in case of an
orphaned resource need to replace the orphaned resource with the
eventually acquired resource from the exporter.

We also need to handle the case where the exporter has an acquired
resource, but the place has been released in the meantime (perhaps due
to a timeout on a normal place acquire), the same poll loop handles this
in the coordinator as well.

All in all this means that the resource acquired state for each place is
not necessarily consistent on the coordinator, but will reach an
eventual consistent state. This should be sufficient, since exporter
restarts with orphaned resources should be relatively rare.

Signed-off-by: Jan Luebbe <jlu@pengutronix.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants