Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map NO_ERROR h2 code to Code.Internal #1113

Closed
wants to merge 2 commits into from
Closed

Conversation

srikrsna-buf
Copy link
Member

Map NO_ERROR h2 code to Code.Internal.

Signed-off-by: Sri Krishna Paritala <skrishna@buf.build>
Signed-off-by: Sri Krishna Paritala <skrishna@buf.build>
Copy link
Member

@jhump jhump left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just one question, but nothing blocking for this change.

Comment on lines +331 to +332
stream.rstCode === http2.constants.NGHTTP2_NO_ERROR &&
!receivedData
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we've received data but no status with the gRPC protocol (which requires a "grpc-status" trailer)? Ideally, that would also result in an internal, so that if we receive message data but the stream is aborted before we get the final trailers, we can still report an error.

When we were both looking at this, we spent a good bit of time also looking at the "GOAWAY" case. IIUC, this doesn't address that -- if there is a "GOAWAY" frame with a "NO_ERROR" status, any streams that were created before the frame was received but with a frame ID higher than the server-indicated max ID will just hang until the connection is closed (in which case they get a network error), right? I know that's a separate bug that doesn't need to be addressed here. Just confirming if my understanding is correct and that is indeed a separate bug. I think where we left it is that it needed some more testing to see exactly what the Node HTTP/2 would do. If there is indeed another semi-related issue with that, we should make sure there is an issue to track it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the GOAWAY case we need testing to confirm the behavior.

What happens if we've received data but no status with the gRPC protocol (which requires a "grpc-status" trailer)?

If we already started receiving data will the server send a NO_ERROR? I understand it can do to lame duck in which case it is not an error for the current stream but to indicate no new connections?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lame ducking uses GOAWAY frames, which are connection-wide, not stream-specific. I thought this chunk of code was called in the face of a RST_STREAM frame, which is for terminating a single stream/operation, not the whole connection.

A server could send a RST_STREAM frame with NO_ERROR at any time, even if data has already been sent. It may not be common/expected in practice, but I thought that was actually the reported bug -- that it does happen and the client doesn't handle it correctly. I think the whole reason it gets classified as "internal" in the gRPC spec is because it is unexpected and shouldn't really happen in practice.

There is one "expected" scenario where this happens which shouldn't (hopefully) trigger this code. It should actually be ignored by the client because the client should have already reported a status from the trailers/end of body. It happens if the server ends the response stream (and, in gRPC, sends back trailers with a status), but the client never half-closed the request side of the stream. In such an instance, the RST_STREAM frame tells the client that the stream is complete and to not bother trying to send anything else for the stream, even to end/close the request side.

A Google search suggests that some HTTP/2 proxies may also do this to selectively cancel streams for which the downstream connection (to backend) has been closed or experienced some network failure (though they should really use INTERNAL_ERROR instead; but either way it would get reported to the client as an "internal" RPC error).

So with the current logic, where we have to just ignore the error if data has been sent (since we can't distinguish the framework doing this to signal the stream is done vs. an actual RST_STREAM frame with NO_ERROR), we would hopefully detect that this is an error because we'd observe that the response was not complete (either incomplete/malformed body and/or missing trailers/status).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there is a "GOAWAY" frame with a "NO_ERROR" status, any streams that were created before the frame was received but with a frame ID higher than the server-indicated max ID will just hang until the connection is closed (in which case they get a network error), right?

To clarify: For a graceful shutdown (GOAWAY frame with NO_ERROR), the client keeps any open streams open until they finish, and opens a new connection for new streams. So any unfinished requests continue as normal, and new requests work as normal from the user's perspective.

But you observed a different behavior with a frame ID that does not include one or more open streams, and the streams in question hang - as in they don't receive or transmit data anymore?

If this is a situation we can expect in regular operation, this does indeed look like a bug. It would be helpful to have a reproducible example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify: For a graceful shutdown (GOAWAY frame with NO_ERROR), the client keeps any open streams open until they finish, and opens a new connection for new streams. So any unfinished requests continue as normal, and new requests work as normal from the user's perspective.

This isn't entirely correct. The GOAWAY frame includes a max stream ID. So any stream created by the client whose ID is greater than that should not be allowed to finish but should instead be retried on a different connection. With the current implementation in connect-es, the max stream ID is not inspected and nothing is done with those streams, so it is possible/likely that they will simply hang until the connection is actually closed. However, this warrants further testing since it is unclear whether the Node HTTP/2 implementation does some amount of handling of GOAWAY transparently 🤷. Seems unlikely, but we should try to repro first. It is likely that fixing this in the connect-es code could require a backwards-incompatible change to this transport interface, so we should verify whether it's a bug before finalizing v2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So any stream created by the client whose ID is greater than that

FWIW, this only happens when there is a race condition. For example, the client creates a stream and sends the headers frame. The client is responsible for assigning stream IDs, so it assigns ID X. The server sends the GOAWAY before processing the headers frame for stream X, so it may indicate a max stream ID of X - 2 (minus two, instead of one, since client-initiated streams are always odd).

@srikrsna-buf srikrsna-buf marked this pull request as ready for review July 11, 2024 20:03
Comment on lines +321 to +323
stream.on("response", () => {
receivedData = true;
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's sensible to raise an error for a RST_STREAM with NO_ERROR received before response headers. But I don't think it fully addresses #1108. The issue mentions an open streaming RPC, so the situation is much more likely to occur after response headers have already been received.

Ideally, the logic would be to raise an error for any RST_STREAM with NO_ERROR, unless we already received a status in headers or trailers.

This is impossible to implement in the universal client, since it doesn't know the status (which could be encoded in the body, or in trailers, depending on the protocol).

But can't we verify that we receive a trailer with a grpc-status field here? https://github.com/connectrpc/connect-es/blob/v1.4.0/packages/connect/src/protocol-grpc/transport.ts#L248 It looks like we currently only raise an error from the function validateTrailer if we receive an error code, not if we're missing the field.

This will be more complex to test (we should also check unary, gRPC-web, and Connect), but I think this should address the issue exhaustively.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should we raise an error on each case of a missing trailer irrespective of RST stream?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, definitely. And we should add conformance test cases to verify this since it's possible that we have similar issues in other client implementations.

Looking at the specs for error codes, I think this would fall into the category of "Error parsing returned status", which would map to an UNKNOWN code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grpcurl (as stated in the issue by the user) reports a code internal, grpc-js also uses internal (grpc/grpc-node#2569). I wonder what go does.

@srikrsna-buf
Copy link
Member Author

Superseded by #1205

@timostamm timostamm deleted the sk/h2-no-error branch September 19, 2024 11:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants