-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Summary
Benchmark FunctionGemma vs gpt-4o-mini for tool calling in AutoGenLLMParser.
Scope
Testing the parse_trade_request FunctionTool dispatch - NOT reasoning about whether trades are good. This is structured extraction into a fixed schema.
Current code: src/parsers/autogen_llm_parser.py:75-110
FunctionTool Schema
def _parse_trade_request(
request_type: Literal["trade", "status_query"],
ticker: str,
action: Literal["review", "buy", "sell"],
quantity: Optional[int] = None,
price: Optional[float] = None,
asset_type: Literal["stock", "option"] = "stock",
timing: Optional[Literal["now", "pullback", "dip", "breakout", "limit"]] = None,
) -> dict:Test Cases (Tool Dispatch Only)
test_cases = [
# Input, Expected output fields
("buy 50 AAPL", {"action": "buy", "ticker": "AAPL", "quantity": 50}),
("sell TSLA", {"action": "sell", "ticker": "TSLA"}),
("check SPY at 600", {"action": "review", "ticker": "SPY", "price": 600}),
("any open orders?", {"request_type": "status_query"}),
("show portfolio", {"request_type": "status_query"}),
("buy MSFT on pullback", {"action": "buy", "ticker": "MSFT", "timing": "pullback"}),
]Metrics to Capture
| Metric | gpt-4o-mini | FunctionGemma |
|---|---|---|
| Tool call success rate | ? | ? |
| Field extraction accuracy | ? | ? |
| Schema compliance | ? | ? |
| Avg latency (ms) | ? | ? |
Success Criteria
- ≥95% correct tool calls (right function, valid args)
- ≥90% field accuracy (correct values extracted)
- <100ms average latency
Deliverables
- Benchmark script
tests/benchmarks/llm_parser_benchmark.py - Results documented in
docs/08_research/ - Go/no-go recommendation
Dependencies
- spike: install Ollama + FunctionGemma local inference stack #533 (Ollama infrastructure)
- feat: add LLM backend abstraction layer (OpenAI/Ollama toggle) #534 (LLM backend abstraction)
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request