Skip to content

test: benchmark AutoGenLLMParser with FunctionGemma vs gpt-4o-mini #536

@iAmGiG

Description

@iAmGiG

Summary

Benchmark FunctionGemma vs gpt-4o-mini for tool calling in AutoGenLLMParser.

Scope

Testing the parse_trade_request FunctionTool dispatch - NOT reasoning about whether trades are good. This is structured extraction into a fixed schema.

Current code: src/parsers/autogen_llm_parser.py:75-110

FunctionTool Schema

def _parse_trade_request(
    request_type: Literal["trade", "status_query"],
    ticker: str,
    action: Literal["review", "buy", "sell"],
    quantity: Optional[int] = None,
    price: Optional[float] = None,
    asset_type: Literal["stock", "option"] = "stock",
    timing: Optional[Literal["now", "pullback", "dip", "breakout", "limit"]] = None,
) -> dict:

Test Cases (Tool Dispatch Only)

test_cases = [
    # Input, Expected output fields
    ("buy 50 AAPL", {"action": "buy", "ticker": "AAPL", "quantity": 50}),
    ("sell TSLA", {"action": "sell", "ticker": "TSLA"}),
    ("check SPY at 600", {"action": "review", "ticker": "SPY", "price": 600}),
    ("any open orders?", {"request_type": "status_query"}),
    ("show portfolio", {"request_type": "status_query"}),
    ("buy MSFT on pullback", {"action": "buy", "ticker": "MSFT", "timing": "pullback"}),
]

Metrics to Capture

Metric gpt-4o-mini FunctionGemma
Tool call success rate ? ?
Field extraction accuracy ? ?
Schema compliance ? ?
Avg latency (ms) ? ?

Success Criteria

  • ≥95% correct tool calls (right function, valid args)
  • ≥90% field accuracy (correct values extracted)
  • <100ms average latency

Deliverables

  • Benchmark script tests/benchmarks/llm_parser_benchmark.py
  • Results documented in docs/08_research/
  • Go/no-go recommendation

Dependencies

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions