[data][llm] Use numpy arrays for embeddings to avoid torch.Tensor serialization overhead #59919
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Currently, Ray Data LLM uses torch.Tensor to store embeddings from pooling tasks (e.g., classify) in Ray dataset columns. This causes redundant copies during serialization and deserialization. Using NumPy arrays is better because they support zero-copy deserialization via Ray's shared memory object store.
When embeddings are stored in Ray Data columns:
np.array: Data stored once in shared memory → multiple workers read via pointers (zero-copy)torch.Tensor: Data copied into pickle stream → copied again on deserialize (2x copies)Benchmark Result
Using
np.array, classification tasks with vLLMEngineStage show a ~9% throughput improvement.Related issues
Additional information