Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xds-client acknowledge race condition #4703

Closed
kkalin68 opened this issue Aug 25, 2021 · 7 comments
Closed

xds-client acknowledge race condition #4703

kkalin68 opened this issue Aug 25, 2021 · 7 comments

Comments

@kkalin68
Copy link

Looks like there is a race-condition that leads a partial services discovery by GRPC client. I see the same issue on discovering RouteConfiguration, Cluster or ClusterLoadAssignment

An example of race during ClusterLoadAssignment request - grpc.log

What version of gRPC are you using?

v1.40.0

What version of Go are you using (go version)?

1.16.7

What operating system (Linux, Windows, …) and version?

Linux

What did you do?

I use envoy/go-control-plane to build xDS server. GRPC clients connect to it to discover endpoints. Everything works fine if there is only one GRPC client. When I need to have more than one GRPC client (different destinations) within same application I see the issue.

What did you expect to see?

GRPC should complete the discovery when multiple clients are created

What did you see instead?

GRPC doesn't finish the discovery due to a race (I think it is a race).
Here is what I think is going on

  • GRPC sends a discovery request with one resource name
INFO: 2021/08/25 03:52:26 [xds] [xds-client 0xc0002def00] ADS request sent: {
  "node": {
    "id": "25386355-c3e2-41f5-ad44-2b003c3386f5",
    "metadata": {
        "TEST_PROJECT_ID": "DE455tt"
      },
    "userAgentName": "gRPC Go",
    "userAgentVersion": "1.40.0",
    "clientFeatures": [
      "envoy.lb.does_not_support_overprovisioning"
    ]
  },
  "resourceNames": [
    "howdy-service"
  ],
  "typeUrl": "type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment"
}
  • Before XDS server replies on the request GRPC sends another discovery request of same type with two resource names
INFO: 2021/08/25 03:52:26 [xds] [xds-client 0xc0002def00] ADS request sent: {
  "node": {
    "id": "25386355-c3e2-41f5-ad44-2b003c3386f5",
    "metadata": {
        "TEST_PROJECT_ID": "DE455tt"
      },
    "userAgentName": "gRPC Go",
    "userAgentVersion": "1.40.0",
    "clientFeatures": [
      "envoy.lb.does_not_support_overprovisioning"
    ]
  },
  "resourceNames": [
    "howdy-service",
    "hello-service"
  ],
  "typeUrl": "type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment"
}
  • XDS server replies on the first request and provides nonce at same moment when GRPC is sending the second discovery request
INFO: 2021/08/25 03:52:26 [xds] [xds-client 0xc0002def00] ADS response received: {
  "versionInfo": "1.0",
  "resources": [
    {
      "@type": "type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment",
      "clusterName": "howdy-service",
      "endpoints": [
        {
          "locality": {
            "region": "local",
            "zone": "local"
          },
          "lbEndpoints": [
            {
              "endpoint": {
                "address": {
                  "socketAddress": {
                    "address": "10.42.0.19",
                    "portValue": 50052
                  }
                }
              }
            }
          ],
          "loadBalancingWeight": 1000
        },
        {
          "locality": {
            "region": "local",
            "zone": "local"
          },
          "lbEndpoints": [
            {
              "endpoint": {
                "address": {
                  "socketAddress": {
                    "address": "10.42.1.22",
                    "portValue": 50052
                  }
                }
              }
            }
          ],
          "loadBalancingWeight": 1000
        }
      ]
    }
  ],
  "typeUrl": "type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment",
  "nonce": "4"
}
  • GRPC sees XDS response having resource configuration for one resource (first request) and acknowledges the message using responceNonce
INFO: 2021/08/25 03:52:26 [xds] [xds-client 0xc0002def00] Sending ACK for response type: EndpointsResource, version: 1.0, nonce: 4
INFO: 2021/08/25 03:52:26 [xds] [xds-cluster-resolver-lb 0xc0004df0e0] resource update: [
  {}
]
INFO: 2021/08/25 03:52:26 [xds] [priority-lb 0xc000565d00] Created
INFO: 2021/08/25 03:52:26 [xds] [xds-client 0xc0002def00] ADS request sent: {
  "versionInfo": "1.0",
  "node": {
    "id": "25386355-c3e2-41f5-ad44-2b003c3386f5",
    "metadata": {
        "TEST_PROJECT_ID": "DE455tt"
      },
    "userAgentName": "gRPC Go",
    "userAgentVersion": "1.40.0",
    "clientFeatures": [
      "envoy.lb.does_not_support_overprovisioning"
    ]
  },
  "resourceNames": [
    "howdy-service",
    "hello-service"
  ],
  "typeUrl": "type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment",
  "responseNonce": "4"
}
  • XDS server creates a watch and never bothers to respond back on the second request
@easwars
Copy link
Contributor

easwars commented Aug 25, 2021

This is not a race in gRPC. This is an issue in the go-control-plane.

In gRPC-Go, we use the go-control-plane as our management server in our e2e tests, and we have seen the same issue. The way we work around this issue is to make sure that the management server has all resources before the gRPC client starts querying for a subset of them.

We had filed some bugs on the go-control-plane in the past, but there hasn't been a fix yet.
envoyproxy/go-control-plane#426
envoyproxy/go-control-plane#431

Please let us know if you still think the issue is in gRPC.

@kkalin68
Copy link
Author

kkalin68 commented Aug 25, 2021

@easwars XDS control plane already knows all resources at the moment. GRPC client requests for a single resource (see logs) and almost immediately sends another discovery requests for two resources. XDS responds back with single resource on the first request and didn't respond back on second request because GRPC client ACK the first response. For me it looks like GRPC client shouldn't ACK the first response because it contains only 1 resource when GRPC client knows that it has requested a discovery for two. Please let me know if I am missing something there ?

I have ads flag set to false since all my GRPC clients cannot know all services (and they don't need to know)

@easwars
Copy link
Contributor

easwars commented Aug 25, 2021

I still think gRPC is behaving correctly here.

From the logs, this is the sequence of events:

  • gRPC requests for resource foo: V=, R={foo}, N=, T=EDS
  • Before the server responds, gRPC requests for resource bar as well: V=, R={foo, bar}, N=, T=EDS
  • server responds with only foo: V=X, R={foo...}, N=A, T=EDS
  • gRPC acks the version and nonce, but still requests for foo and bar: V=X, R={foo, bar}, N=A, T=EDS. Based on this request, the server should know that gRPC has only received and acked foo, because it knows the response that it sent for V=X and N=A contained only foo. So, at this point, it is still the server's responsibility to send bar.

See https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol#knowing-when-a-requested-resource-does-not-exist, which talks about why gRPC cannot NACK the above response because it cannot confidently say that the server does not have have bar. It can only do so after waiting for the timeout to expire, at which point you do see the following log:

WARNING: 2021/08/25 03:52:41 [xds] [xds-cluster-resolver-lb 0xc0005a10e0] Watch error from xds-client 0x1740ea0: xds: EndpointsResource target hello-service not found, watcher timeout

I have ads flag set to false since all my GRPC clients cannot know all services (and they don't need to know)

What do you mean by this? gRPC only support ADS.

@kkalin68
Copy link
Author

@easwars Just to clarify - does the log below means that GRPC requests the resources ? I thought it is ACK. I see in xDS logs that xDS server creates a watch upon receiving the request .

INFO: 2021/08/25 03:52:26 [xds] [xds-client 0xc0002def00] ADS request sent: {
  "versionInfo": "1.0",
  "node": {
    "id": "25386355-c3e2-41f5-ad44-2b003c3386f5",
    "metadata": {
        "TEST_PROJECT_ID": "DE455tt"
      },
    "userAgentName": "gRPC Go",
    "userAgentVersion": "1.40.0",
    "clientFeatures": [
      "envoy.lb.does_not_support_overprovisioning"
    ]
  },
  "resourceNames": [
    "howdy-service",
    "hello-service"
  ],
  "typeUrl": "type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment",
  "responseNonce": "4"
}

xDS logs

2021-08-25 03:52:26.434 DBG OnStreamRequest 1  Request[url=type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment; responseNonce=4; resources=[howdy-service hello-service]]
2021-08-25 03:52:26.435 DBG open watch 4 for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment[howdy-service hello-service] from nodeID "25386355-c3e2-41f5-ad44-2b003c3386f5", version "1.0" group=xds-cache

What do you mean by this? gRPC only support ADS.

ads flag in NewSnapshotCache
cache := cachev3.NewSnapshotCache(false, cachev3.IDHash{}, tools.NewLogger("xds-cache"))

@easwars
Copy link
Contributor

easwars commented Aug 25, 2021

@easwars Just to clarify - does the log below means that GRPC requests the resources ? I thought it is ACK. I see in xDS logs that xDS server creates a watch upon receiving the request .

ACK's are piggybacked. A client is expected to ACK or NACK every response that it receives from the server. When the version number matches the version number of the response sent by the server, it is considered an ACK. See: https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol#ack

ads flag in NewSnapshotCache
cache := cachev3.NewSnapshotCache(false, cachev3.IDHash{}, tools.NewLogger("xds-cache"))

I think the ads flag needs to be set to true here. According to the documentation on NewSnapshotCache:

// ADS flag forces a delay in responding to streaming requests until all
// resources are explicitly named in the request. This avoids the problem of a
// partial request over a single stream for a subset of resources which would
// require generating a fresh version for acknowledgement. ADS flag requires
// snapshot consistency. For non-ADS case (and fetch), mutliple partial
// requests are sent across multiple streams and re-using the snapshot version
// is OK.

We set to it to true in our uses in tests.

@kkalin68
Copy link
Author

I see. So the control plane ignores somehow the second discoveryRequest containing responseNonce in my case and just creates a watch. I will dig more there. Thank you for the confirmation. You are probably right that it is not grpc client issue.

I think the ads flag needs to be set to true here. According to the documentation on NewSnapshotCache

If you set the flag = true it means every GRPC client needs to know all available services registered on xDS server regardless if the client needs it or not. I use xDS for service discovery and grpc clients communicate to only services they need. Requirement to list all services kinda defeats the purpose of discovery.

Also I can confirm if I set ads flag to true and force grpc client to send a discovery for all resources, everything works as expected on my setup. But it doesn't work if a client needs only one or couple resources of the full set. xDS control plane ignores such requests.

@easwars
Copy link
Contributor

easwars commented Aug 25, 2021

If you end up opening an issue with the go-control-plane, I'd appreciate it you can tag me there. Just curious to see how this shapes up. Thanks.

@easwars easwars closed this as completed Aug 25, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants