[Serve] Add gRPC inter-deployment communication #59908

eicherseiji · 2026-01-07T01:53:59Z

Why are these changes needed?

This PR adds gRPC-based inter-deployment communication for Ray Serve, allowing deployments to communicate with each other using gRPC transport instead of Ray actor calls. This can provide performance benefits in certain scenarios.

Key Changes

gRPC Server on Replicas: Each replica now starts a gRPC server that can handle requests from other deployments.
gRPC Replica Wrapper: A new gRPCReplicaWrapper class handles sending requests via gRPC and processing responses.
Handle Options: The _by_reference option on handles controls whether to use Ray actor calls (True) or gRPC transport (False).
New Environment Variables:
- RAY_SERVE_USE_GRPC_BY_DEFAULT: Master flag to enable gRPC transport by default for all inter-deployment communication
- RAY_SERVE_PROXY_USE_GRPC: Controls whether the proxy uses gRPC transport (defaults to the master flag value)
- RAY_SERVE_GRPC_MAX_MESSAGE_SIZE: Configures the maximum gRPC message size (default: 2GB-1)

Related issue number

N/A

Checks

I've signed all my commits
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  temporary testing hook, I've added it under the API Reference
  (Experimental) page.
I've made sure the tests are passing. Note that there might be a few flaky tests.

Test Plan

python/ray/serve/tests/test_grpc_e2e.py
python/ray/serve/tests/test_grpc_replica_wrapper.py
python/ray/serve/tests/unit/test_grpc_replica_result.py

Benchmarks

Script available here

Results show throughput/latency improvements w/ gRPC for message size < ~1MB.

  ⎿  ==============================================================================
       gRPC vs Plasma Benchmark Results
     ==============================================================================

       Payload  Metric               Plasma         gRPC          Δ     Winner
       ----------------------------------------------------------------------
       1 KB     Latency p50          2.63ms       1.89ms       +28%       gRPC
                Chain p50            4.11ms       3.02ms       +26%       gRPC
                Throughput            160/s        190/s       +16%       gRPC
       ----------------------------------------------------------------------
       10 KB    Latency p50          2.68ms       1.68ms       +37%       gRPC
                Chain p50            3.91ms       2.94ms       +25%       gRPC
                Throughput            167/s        185/s       +10%       gRPC
       ----------------------------------------------------------------------
       100 KB   Latency p50          2.74ms       2.02ms       +26%       gRPC
                Chain p50            4.28ms       3.06ms       +28%       gRPC
                Throughput            157/s        182/s       +13%       gRPC
       ----------------------------------------------------------------------
       500 KB   Latency p50          5.78ms       3.52ms       +39%       gRPC
                Chain p50            5.65ms       4.82ms       +15%       gRPC
                Throughput            114/s        144/s       +21%       gRPC
       ----------------------------------------------------------------------
       1 MB     Latency p50          6.31ms       5.18ms       +18%       gRPC
                Chain p50            5.96ms       6.20ms        -4%     Plasma
                Throughput            130/s        165/s       +21%       gRPC
       ----------------------------------------------------------------------
       2 MB     Latency p50          8.82ms       9.57ms        -9%     Plasma
                Chain p50            7.20ms      10.69ms       -48%     Plasma
                Throughput            123/s        106/s       -16%     Plasma
       ----------------------------------------------------------------------
       5 MB     Latency p50         15.20ms      23.72ms       -56%     Plasma
                Chain p50            8.90ms      23.25ms      -161%     Plasma
                Throughput             78/s         49/s       -58%     Plasma
       ----------------------------------------------------------------------
       10 MB    Latency p50         25.02ms      34.34ms       -37%     Plasma
                Chain p50            9.72ms      34.71ms      -257%     Plasma
                Throughput             38/s         31/s       -24%     Plasma
       ----------------------------------------------------------------------

gemini-code-assist

Code Review

This pull request introduces a significant new feature to Ray Serve: inter-deployment communication via gRPC. This provides an alternative to the default Ray actor calls, which can be beneficial for performance and for certain network environments. The feature is enabled by a new _by_reference=False option on deployment handles.

The changes are comprehensive, touching multiple layers of the Serve stack:

A new InterDeploymentService is defined in the protobuf, with RPCs for unary and streaming requests, including support for backpressure.
A flexible serialization layer (RPCSerializer) is added, supporting cloudpickle, pickle, msgpack, orjson, and a noop for raw bytes. This allows users to choose the best serialization method for their use case.
Replicas now run a gRPC server to handle these incoming requests. The logic is nicely encapsulated in a decorator (_wrap_inter_deployment_grpc_call).
On the client side, a new gRPCReplicaWrapper and gRPCReplicaResult are introduced to handle sending requests and receiving responses over gRPC.
The DeploymentHandle.options() method is extended with _by_reference, _request_serialization, and _response_serialization to control this new communication channel.
New tests are added to validate the gRPC transport, including different serialization methods and streaming.

Overall, this is a well-designed and well-implemented feature. The code is clear and the changes are well-contained. I have one suggestion for refactoring to reduce code duplication.

python/ray/serve/_private/replica.py

cursor · 2026-01-07T02:00:23Z

src/ray/protobuf/serve.proto

+  rpc HandleRequestWithRejection(InterDeploymentRequest) returns (InterDeploymentResponse);
+  // Streaming request with rejection support for backpressure.
+  rpc HandleRequestWithRejectionStreaming(InterDeploymentRequest) returns (stream InterDeploymentResponse);
+}


Proto file modified requires RPC standards review (Bugbot Rules)

Medium Severity

⚠️ This PR modifies one or more .proto files.
Please review the RPC fault-tolerance & idempotency standards guide here:
https://github.com/ray-project/ray/tree/master/doc/source/ray-core/internals/rpc-fault-tolerance.rst

python/ray/serve/_private/replica.py

python/ray/serve/tests/test_grpc_inter_deployment.py

python/ray/serve/_private/request_router/replica_wrapper.py

eicherseiji · 2026-01-07T21:16:12Z

python/ray/serve/_private/serialization.py

Identical to parity implementation

eicherseiji · 2026-01-07T21:16:34Z

python/ray/serve/_private/common.py

Identical to parity implementation

eicherseiji · 2026-01-07T21:18:34Z

python/ray/serve/handle.py

Identical to parity implementation

eicherseiji · 2026-01-07T21:21:11Z

src/ray/protobuf/serve.proto

Identical to parity implementation

Add support for deployments to communicate via gRPC instead of Ray actor calls. This is enabled by setting `_by_reference=False` on a deployment handle: ```python handle = serve.get_deployment_handle("Downstream", "app") grpc_handle = handle.options(_by_reference=False) result = await grpc_handle.remote(data) ``` Changes: - Add InterDeploymentService protobuf with HandleRequest, HandleRequestStreaming HandleRequestWithRejection, HandleRequestWithRejectionStreaming RPCs - Add _by_reference, _request_serialization, _response_serialization handle options - Add RPCSerializer supporting cloudpickle, pickle, msgpack, orjson, noop - Add gRPCReplicaWrapper for client-side gRPC transport - Add gRPCReplicaResult for handling gRPC responses - Add gRPC server to Replica implementing InterDeploymentServiceServicer - Add grpc_port field to RunningReplicaInfo Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

eicherseiji · 2026-01-07T21:56:58Z

python/ray/serve/_private/replica.py

 logger = logging.getLogger(SERVE_LOGGER_NAME)


+def _wrap_grpc_call(f):


Identical to parity implementation

eicherseiji · 2026-01-07T22:05:35Z

python/ray/serve/_private/replica.py

        with self._handle_errors_and_metrics(request_metadata) as status_code_callback:
            yield status_code_callback

+    @_wrap_grpc_call


Following code block identical to parity implementation

eicherseiji · 2026-01-07T22:10:58Z

python/ray/serve/_private/replica_result.py

        return self._obj_ref_gen
+
+
+class gRPCReplicaResult(ReplicaResult):


Identical to parity implementation

eicherseiji · 2026-01-07T22:11:58Z

python/ray/serve/_private/replica.py

    GRPC_CONTEXT_ARG_NAME,
    HEALTH_CHECK_METHOD,
    RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE,
+    RAY_SERVE_GRPC_MAX_MESSAGE_SIZE,


Divergence: Currently set to 2GB. Should we bump this to 4GB?

let's bump it up

eicherseiji · 2026-01-07T22:16:15Z

python/ray/serve/_private/request_router/replica_wrapper.py

        )


+class gRPCReplicaWrapper(ReplicaWrapper):


Identical to parity implementation

eicherseiji · 2026-01-07T22:40:44Z

python/ray/serve/_private/default_impl.py

+        _by_reference=handle_options._by_reference,
+        _on_separate_loop=init_options._run_router_in_separate_loop,
+        request_serialization=handle_options.request_serialization,
+        response_serialization=handle_options.response_serialization,


Additions identical to parity implementation

eicherseiji · 2026-01-07T22:46:09Z

python/ray/serve/_private/constants.py

    "RAY_SERVE_RUN_ROUTER_IN_SEPARATE_LOOP", "1"
 )

+# For now, this is used only for testing. In the suite of tests that


Logic identical to parity implementation

eicherseiji · 2026-01-07T22:46:30Z

python/ray/serve/tests/BUILD.bazel

eicherseiji · 2026-01-07T22:46:50Z

python/ray/serve/tests/test_grpc_e2e.py

eicherseiji · 2026-01-07T22:47:36Z

python/ray/serve/tests/unit/test_grpc_replica_result.py

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

eicherseiji · 2026-01-08T00:42:31Z

python/ray/serve/_private/utils.py

 MESSAGE_PACK_OFFSET = 9


+def asyncio_grpc_exception_handler(loop, context):


Identical to parity implementation

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

eicherseiji · 2026-01-08T18:25:33Z

python/ray/serve/_private/proxy.py

Identical to parity implementation

python/ray/serve/_private/common.py

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

cursor · 2026-01-10T00:35:45Z

python/ray/serve/_private/replica.py

+                    RAY_SERVE_GRPC_MAX_MESSAGE_SIZE,
+                )
+            ]
+        )


Missing max_send_message_length in gRPC server configuration

High Severity

The gRPC server and client channel configurations only set grpc.max_receive_message_length but are missing grpc.max_send_message_length. The existing DEFAULT_GRPC_SERVER_OPTIONS in constants.py sets both options. Without max_send_message_length, the server cannot send responses larger than the default 4MB limit, and the client cannot send requests larger than 4MB. This will cause failures for inter-deployment communication with large payloads, which contradicts the PR's benchmarks testing up to 10MB messages.

Additional Locations (1)

python/ray/serve/_private/request_router/replica_wrapper.py#L235-L243

eicherseiji requested a review from a team as a code owner January 7, 2026 01:54

gemini-code-assist bot reviewed Jan 7, 2026

View reviewed changes

python/ray/serve/_private/replica.py Outdated Show resolved Hide resolved

cursor bot reviewed Jan 7, 2026

View reviewed changes

eicherseiji marked this pull request as draft January 7, 2026 02:01

eicherseiji changed the title ~~[Serve] Add gRPC inter-deployment communication~~ [Serve][WIP][Do not merge] Add gRPC inter-deployment communication Jan 7, 2026

eicherseiji commented Jan 7, 2026

View reviewed changes

eicherseiji added 8 commits January 7, 2026 13:45

Minimal diff

a183cde

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Add tests

e292a47

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Reuse port variable

63a6433

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Clean up

f74f8c4

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Simplify

c1c0880

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

WIP

e026c98

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

WIP

0c1fee2

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

eicherseiji force-pushed the grpc-inter-deployment branch 2 times, most recently from 27d99b1 to fda4b8e Compare January 7, 2026 21:49

eicherseiji added 9 commits January 7, 2026 13:58

Rename _inter_deployment_grpc_server to _server

e9c4102

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Remove ASGIServiceServicer inheritance (duck-type interface)

f6021d1

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Rename _grpc_channel/_grpc_stub to _channel/_stub, use stub property

a749fc1

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Remove type hints from _channel/_stub

f6b5840

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Cache replica wrappers

89d2ab0

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Rename test_grpc_inter_deployment.py to test_grpc_e2e.py

0fe57b0

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Add RAY_SERVE_USE_GRPC_BY_DEFAULT and RAY_SERVE_PROXY_USE_GRPC flags

519e7b7

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Remove unnecessary comment for RAY_SERVE_PROXY_USE_GRPC

a65e0e2

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Use os.environ.get instead of get_env_bool

0ec7116

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

eicherseiji commented Jan 7, 2026

View reviewed changes

eicherseiji added the go add ONLY when ready to merge, run all tests label Jan 7, 2026

eicherseiji changed the title ~~[Serve][WIP][Do not merge] Add gRPC inter-deployment communication~~ [Serve] Add gRPC inter-deployment communication Jan 7, 2026

eicherseiji force-pushed the grpc-inter-deployment branch 2 times, most recently from 280d413 to 0ec7116 Compare January 7, 2026 23:04

eicherseiji added 2 commits January 7, 2026 15:53

Suppress spammy exceptions

0500ccc

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Update spammy fix

9e3eb66

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

eicherseiji commented Jan 8, 2026

View reviewed changes

Add exception handler to ProxyActor

79621cb

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

eicherseiji marked this pull request as ready for review January 8, 2026 18:25

eicherseiji commented Jan 8, 2026

View reviewed changes

python/ray/serve/_private/proxy.py

Copy link

Contributor Author

eicherseiji Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identical to parity implementation

ray-gardener bot added the serve Ray Serve Related Issue label Jan 8, 2026

eicherseiji commented Jan 8, 2026

View reviewed changes

python/ray/serve/_private/common.py Show resolved Hide resolved

Run serve tests with GRPC enabled

c0692a4

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

cursor bot reviewed Jan 10, 2026

View reviewed changes

		logger = logging.getLogger(SERVE_LOGGER_NAME)


		def _wrap_grpc_call(f):

		return self._obj_ref_gen


		class gRPCReplicaResult(ReplicaResult):

		MESSAGE_PACK_OFFSET = 9


		def asyncio_grpc_exception_handler(loop, context):

[Serve] Add gRPC inter-deployment communication #59908

Are you sure you want to change the base?

[Serve] Add gRPC inter-deployment communication #59908

Uh oh!

Conversation

eicherseiji commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Key Changes

Related issue number

Checks

Test Plan

Benchmarks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

cursor bot Jan 7, 2026

Choose a reason for hiding this comment

Proto file modified requires RPC standards review (Bugbot Rules)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot Jan 10, 2026

Choose a reason for hiding this comment

Missing max_send_message_length in gRPC server configuration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eicherseiji commented Jan 7, 2026 •

edited

Loading