-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Fix: LiteLLM Streaming Content Duplication in Tool Call Responses #3698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix: LiteLLM Streaming Content Duplication in Tool Call Responses #3698
Conversation
Fixes google#3665 Streaming responses from LiteLLM models (Claude, GPT, etc.) were not setting finish_reason on aggregated LlmResponse objects, causing agent runners to not properly recognize completion states. This fix mirrors the finish_reason mapping logic from the non-streaming path (lines 776-784) and applies it to both streaming code paths: - Tool call responses (lines 1340-1368) - Text-only responses (lines 1369-1390) Without this fix, agents using Claude or GPT via LiteLLM would encounter stop conditions that couldn't be properly handled, leading to incomplete responses or unexpected agent behavior. Tested with Claude Sonnet 4.5 and GPT-5 via Azure OpenAI in production multi-agent system with MCP tools.
Fixes google#3676 MCP tool responses arrive as JSON strings but were being double-serialized by _safe_json_serialize(), creating triple-nested JSON that prevented Claude and GPT from parsing tool results. Example of the bug: '{"content": [{"type": "text", "text": "{\n \"type\"..."}]}' This fix adds an isinstance(str) check before serialization. If the response is already a string (from MCP or other sources), it's used directly. Otherwise, it's serialized normally. Impact: Without this fix, agents using LiteLLM with MCP tools would successfully call tools but fail to present results to users, appearing to hang or produce incomplete responses. Tested with Claude Sonnet 4.5 and GPT-5 via Azure OpenAI with MCP tools (Google Drive, HubSpot CRM) in production multi-agent system.
…plication - Fixes content duplication where planning/reasoning text appears twice (once during streaming, again in aggregated tool-call message) - Aligns with OpenAI/LiteLLM conventions for tool-call messages - Planning text is preserved in thought_parts and already streamed to users - Resolves semantic confusion where tool-call messages contained text content Fixes google#3697
Summary of ChangesHello @thesynapses, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves critical issues related to content handling and API compliance in LiteLLM streaming responses, especially when tool calls are involved. It refines how planning text and tool responses are processed during streaming to ensure a clean, accurate, and API-compliant conversation history, while also improving the agent runner's ability to interpret completion states from various models. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Summary of ChangesHello @thesynapses, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the robustness and API compliance of LiteLLM integrations within ADK, particularly for streaming workflows involving tool calls. It resolves critical issues related to content duplication, incorrect serialization of tool responses, and inconsistent Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively resolves the content duplication issue in LiteLLM streaming responses by setting content=None for tool-call messages, which is a good fix that aligns with API conventions. The additional changes to prevent double-serialization of tool response content and to correctly map the finish_reason for streaming responses are also valuable improvements. I've noticed a small area of code duplication in the new finish_reason handling logic and have left a comment with a suggestion to refactor it for better maintainability. Overall, the changes are well-implemented and improve the robustness of the LiteLLM integration.
src/google/adk/models/lite_llm.py
Outdated
| if isinstance(finish_reason, types.FinishReason): | ||
| aggregated_llm_response_with_tool_call.finish_reason = finish_reason | ||
| else: | ||
| finish_reason_str = str(finish_reason).lower() | ||
| aggregated_llm_response_with_tool_call.finish_reason = _FINISH_REASON_MAPPING.get( | ||
| finish_reason_str, types.FinishReason.OTHER | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/google/adk/models/lite_llm.py
Outdated
| if isinstance(finish_reason, types.FinishReason): | ||
| aggregated_llm_response.finish_reason = finish_reason | ||
| else: | ||
| finish_reason_str = str(finish_reason).lower() | ||
| aggregated_llm_response.finish_reason = _FINISH_REASON_MAPPING.get( | ||
| finish_reason_str, types.FinishReason.OTHER | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses the content duplication issue in LiteLLM streaming responses by correctly setting content=None for tool-call messages. The changes are well-documented and align with the OpenAI/LiteLLM specification. Additionally, the fixes for preventing double JSON serialization and for consistently mapping finish_reason in streaming mode are valuable improvements that enhance correctness and reliability.
My only suggestion is to refactor the duplicated logic for mapping finish_reason to improve code maintainability.
src/google/adk/models/lite_llm.py
Outdated
| # FIX: Map finish_reason to FinishReason enum for streaming text-only responses. | ||
| # Previously, streaming responses did not set finish_reason on aggregated | ||
| # LlmResponse objects, causing the ADK agent runner to not properly recognize | ||
| # completion states. This mirrors the logic from non-streaming path (lines 776-784) | ||
| # to ensure consistent behavior across both streaming and non-streaming modes. | ||
| # Without this, Claude and other models via LiteLLM would hit stop conditions | ||
| # that the agent couldn't properly handle. | ||
| if isinstance(finish_reason, types.FinishReason): | ||
| aggregated_llm_response.finish_reason = finish_reason | ||
| else: | ||
| finish_reason_str = str(finish_reason).lower() | ||
| aggregated_llm_response.finish_reason = _FINISH_REASON_MAPPING.get( | ||
| finish_reason_str, types.FinishReason.OTHER | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic for mapping finish_reason is duplicated from lines 1374-1380. To improve maintainability and reduce code repetition, consider refactoring this logic into a private helper function.
For example, you could create a function like _set_finish_reason(response: LlmResponse, finish_reason: Any) that encapsulates this mapping and assignment.
def _set_finish_reason(response: LlmResponse, finish_reason: Any):
"""Sets the finish reason on the LlmResponse, mapping from string if necessary."""
if isinstance(finish_reason, types.FinishReason):
response.finish_reason = finish_reason
else:
finish_reason_str = str(finish_reason).lower()
response.finish_reason = _FINISH_REASON_MAPPING.get(
finish_reason_str, types.FinishReason.OTHER
)This would make the code cleaner and easier to manage if the mapping logic needs to change in the future.
…tion Extract finish_reason mapping logic into a reusable helper function to address code duplication feedback from Gemini Code Assist review on PR google#3698. Changes: - Added _set_finish_reason(response, finish_reason) helper function - Replaced three duplicate mapping blocks with single helper call: * Non-streaming path (line ~880) * Streaming tool-call path (line ~1387) * Streaming text-only path (line ~1409) - Preserved all existing comments and behavior - Improved maintainability - single source of truth for mapping logic Addresses: https://github.com/google/adk-python/pull/3698\#discussion_r18xxxxx
|
Hi @thesynapses , Thank you for your work on this pull request. We appreciate the effort you've invested. |
Fix: LiteLLM Streaming Content Duplication in Tool Call Responses
Summary
Fixes content duplication in streaming responses when using LiteLLM models with ADK's planning features (e.g.,
PlanReActPlanner). Planning and reasoning text was appearing twice: once during streaming as individual chunks, and again in the aggregated tool-call message.Resolves #3697
Problem
When the model generates planning/reasoning text (e.g.,
<PLANNING>I need to search...</PLANNING>) followed by tool calls during streaming:content=text(line 1352)This violates OpenAI/LiteLLM conventions where tool-call-only messages should have
content=None.Solution
Changed line 1352 to set
content=Nonefor tool-call messages:Why This Works
thought_parts: Reasoning captured separately at line 1357 for conversation historycontent=Nonefor tool-only messagesImpact
thought_partsTesting
Tested with:
Additional Context
thought_partsparameter preserves reasoning separately from message content