Skip to content

Conversation

@eicherseiji
Copy link
Contributor

@eicherseiji eicherseiji commented Jan 7, 2026

Why are these changes needed?

This PR adds gRPC-based inter-deployment communication for Ray Serve, allowing deployments to communicate with each other using gRPC transport instead of Ray actor calls. This can provide performance benefits in certain scenarios.

Key Changes

  1. gRPC Server on Replicas: Each replica now starts a gRPC server that can handle requests from other deployments.

  2. gRPC Replica Wrapper: A new gRPCReplicaWrapper class handles sending requests via gRPC and processing responses.

  3. Handle Options: The _by_reference option on handles controls whether to use Ray actor calls (True) or gRPC transport (False).

  4. New Environment Variables:

    • RAY_SERVE_USE_GRPC_BY_DEFAULT: Master flag to enable gRPC transport by default for all inter-deployment communication
    • RAY_SERVE_PROXY_USE_GRPC: Controls whether the proxy uses gRPC transport (defaults to the master flag value)
    • RAY_SERVE_GRPC_MAX_MESSAGE_SIZE: Configures the maximum gRPC message size (default: 2GB-1)

Related issue number

N/A

Checks

  • I've signed all my commits
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      temporary testing hook, I've added it under the API Reference
      (Experimental) page.
  • I've made sure the tests are passing. Note that there might be a few flaky tests.

Test Plan

  • python/ray/serve/tests/test_grpc_e2e.py
  • python/ray/serve/tests/test_grpc_replica_wrapper.py
  • python/ray/serve/tests/unit/test_grpc_replica_result.py

Benchmarks

Script available here

Results show throughput/latency improvements w/ gRPC for message size < ~1MB.

benchmark_plot
  ⎿  ==============================================================================
       gRPC vs Plasma Benchmark Results
     ==============================================================================

       Payload  Metric               Plasma         gRPC          Δ     Winner
       ----------------------------------------------------------------------
       1 KB     Latency p50          2.63ms       1.89ms       +28%       gRPC
                Chain p50            4.11ms       3.02ms       +26%       gRPC
                Throughput            160/s        190/s       +16%       gRPC
       ----------------------------------------------------------------------
       10 KB    Latency p50          2.68ms       1.68ms       +37%       gRPC
                Chain p50            3.91ms       2.94ms       +25%       gRPC
                Throughput            167/s        185/s       +10%       gRPC
       ----------------------------------------------------------------------
       100 KB   Latency p50          2.74ms       2.02ms       +26%       gRPC
                Chain p50            4.28ms       3.06ms       +28%       gRPC
                Throughput            157/s        182/s       +13%       gRPC
       ----------------------------------------------------------------------
       500 KB   Latency p50          5.78ms       3.52ms       +39%       gRPC
                Chain p50            5.65ms       4.82ms       +15%       gRPC
                Throughput            114/s        144/s       +21%       gRPC
       ----------------------------------------------------------------------
       1 MB     Latency p50          6.31ms       5.18ms       +18%       gRPC
                Chain p50            5.96ms       6.20ms        -4%     Plasma
                Throughput            130/s        165/s       +21%       gRPC
       ----------------------------------------------------------------------
       2 MB     Latency p50          8.82ms       9.57ms        -9%     Plasma
                Chain p50            7.20ms      10.69ms       -48%     Plasma
                Throughput            123/s        106/s       -16%     Plasma
       ----------------------------------------------------------------------
       5 MB     Latency p50         15.20ms      23.72ms       -56%     Plasma
                Chain p50            8.90ms      23.25ms      -161%     Plasma
                Throughput             78/s         49/s       -58%     Plasma
       ----------------------------------------------------------------------
       10 MB    Latency p50         25.02ms      34.34ms       -37%     Plasma
                Chain p50            9.72ms      34.71ms      -257%     Plasma
                Throughput             38/s         31/s       -24%     Plasma
       ----------------------------------------------------------------------

@eicherseiji eicherseiji requested a review from a team as a code owner January 7, 2026 01:54
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant new feature to Ray Serve: inter-deployment communication via gRPC. This provides an alternative to the default Ray actor calls, which can be beneficial for performance and for certain network environments. The feature is enabled by a new _by_reference=False option on deployment handles.

The changes are comprehensive, touching multiple layers of the Serve stack:

  • A new InterDeploymentService is defined in the protobuf, with RPCs for unary and streaming requests, including support for backpressure.
  • A flexible serialization layer (RPCSerializer) is added, supporting cloudpickle, pickle, msgpack, orjson, and a noop for raw bytes. This allows users to choose the best serialization method for their use case.
  • Replicas now run a gRPC server to handle these incoming requests. The logic is nicely encapsulated in a decorator (_wrap_inter_deployment_grpc_call).
  • On the client side, a new gRPCReplicaWrapper and gRPCReplicaResult are introduced to handle sending requests and receiving responses over gRPC.
  • The DeploymentHandle.options() method is extended with _by_reference, _request_serialization, and _response_serialization to control this new communication channel.
  • New tests are added to validate the gRPC transport, including different serialization methods and streaming.

Overall, this is a well-designed and well-implemented feature. The code is clear and the changes are well-contained. I have one suggestion for refactoring to reduce code duplication.

rpc HandleRequestWithRejection(InterDeploymentRequest) returns (InterDeploymentResponse);
// Streaming request with rejection support for backpressure.
rpc HandleRequestWithRejectionStreaming(InterDeploymentRequest) returns (stream InterDeploymentResponse);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proto file modified requires RPC standards review (Bugbot Rules)

Medium Severity

⚠️ This PR modifies one or more .proto files.
Please review the RPC fault-tolerance & idempotency standards guide here:
https://github.com/ray-project/ray/tree/master/doc/source/ray-core/internals/rpc-fault-tolerance.rst

Fix in Cursor Fix in Web

@eicherseiji eicherseiji marked this pull request as draft January 7, 2026 02:01
@eicherseiji eicherseiji changed the title [Serve] Add gRPC inter-deployment communication [Serve][WIP][Do not merge] Add gRPC inter-deployment communication Jan 7, 2026
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identical to parity implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identical to parity implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identical to parity implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identical to parity implementation

Add support for deployments to communicate via gRPC instead of Ray actor
calls. This is enabled by setting `_by_reference=False` on a deployment
handle:

```python
handle = serve.get_deployment_handle("Downstream", "app")
grpc_handle = handle.options(_by_reference=False)
result = await grpc_handle.remote(data)
```

Changes:
- Add InterDeploymentService protobuf with HandleRequest, HandleRequestStreaming
  HandleRequestWithRejection, HandleRequestWithRejectionStreaming RPCs
- Add _by_reference, _request_serialization, _response_serialization handle options
- Add RPCSerializer supporting cloudpickle, pickle, msgpack, orjson, noop
- Add gRPCReplicaWrapper for client-side gRPC transport
- Add gRPCReplicaResult for handling gRPC responses
- Add gRPC server to Replica implementing InterDeploymentServiceServicer
- Add grpc_port field to RunningReplicaInfo

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
@eicherseiji eicherseiji force-pushed the grpc-inter-deployment branch 2 times, most recently from 27d99b1 to fda4b8e Compare January 7, 2026 21:49
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
logger = logging.getLogger(SERVE_LOGGER_NAME)


def _wrap_grpc_call(f):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identical to parity implementation

with self._handle_errors_and_metrics(request_metadata) as status_code_callback:
yield status_code_callback

@_wrap_grpc_call
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following code block identical to parity implementation

return self._obj_ref_gen


class gRPCReplicaResult(ReplicaResult):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identical to parity implementation

GRPC_CONTEXT_ARG_NAME,
HEALTH_CHECK_METHOD,
RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE,
RAY_SERVE_GRPC_MAX_MESSAGE_SIZE,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Divergence: Currently set to 2GB. Should we bump this to 4GB?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's bump it up

)


class gRPCReplicaWrapper(ReplicaWrapper):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identical to parity implementation

_by_reference=handle_options._by_reference,
_on_separate_loop=init_options._run_router_in_separate_loop,
request_serialization=handle_options.request_serialization,
response_serialization=handle_options.response_serialization,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additions identical to parity implementation

"RAY_SERVE_RUN_ROUTER_IN_SEPARATE_LOOP", "1"
)

# For now, this is used only for testing. In the suite of tests that
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic identical to parity implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identical

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identical

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identical

@eicherseiji eicherseiji added the go add ONLY when ready to merge, run all tests label Jan 7, 2026
@eicherseiji eicherseiji changed the title [Serve][WIP][Do not merge] Add gRPC inter-deployment communication [Serve] Add gRPC inter-deployment communication Jan 7, 2026
@eicherseiji eicherseiji force-pushed the grpc-inter-deployment branch 2 times, most recently from 280d413 to 0ec7116 Compare January 7, 2026 23:04
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
MESSAGE_PACK_OFFSET = 9


def asyncio_grpc_exception_handler(loop, context):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identical to parity implementation

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
@eicherseiji eicherseiji marked this pull request as ready for review January 8, 2026 18:25
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identical to parity implementation

@ray-gardener ray-gardener bot added the serve Ray Serve Related Issue label Jan 8, 2026
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
RAY_SERVE_GRPC_MAX_MESSAGE_SIZE,
)
]
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing max_send_message_length in gRPC server configuration

High Severity

The gRPC server and client channel configurations only set grpc.max_receive_message_length but are missing grpc.max_send_message_length. The existing DEFAULT_GRPC_SERVER_OPTIONS in constants.py sets both options. Without max_send_message_length, the server cannot send responses larger than the default 4MB limit, and the client cannot send requests larger than 4MB. This will cause failures for inter-deployment communication with large payloads, which contradicts the PR's benchmarks testing up to 10MB messages.

Additional Locations (1)

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants