test: benchmark AutoGenLLMParser with FunctionGemma vs gpt-4o-mini

## Summary
Benchmark FunctionGemma vs gpt-4o-mini for **tool calling** in AutoGenLLMParser.

## Scope
Testing the `parse_trade_request` FunctionTool dispatch - NOT reasoning about whether trades are good. This is structured extraction into a fixed schema.

Current code: `src/parsers/autogen_llm_parser.py:75-110`

## FunctionTool Schema
```python
def _parse_trade_request(
    request_type: Literal["trade", "status_query"],
    ticker: str,
    action: Literal["review", "buy", "sell"],
    quantity: Optional[int] = None,
    price: Optional[float] = None,
    asset_type: Literal["stock", "option"] = "stock",
    timing: Optional[Literal["now", "pullback", "dip", "breakout", "limit"]] = None,
) -> dict:
```

## Test Cases (Tool Dispatch Only)
```python
test_cases = [
    # Input, Expected output fields
    ("buy 50 AAPL", {"action": "buy", "ticker": "AAPL", "quantity": 50}),
    ("sell TSLA", {"action": "sell", "ticker": "TSLA"}),
    ("check SPY at 600", {"action": "review", "ticker": "SPY", "price": 600}),
    ("any open orders?", {"request_type": "status_query"}),
    ("show portfolio", {"request_type": "status_query"}),
    ("buy MSFT on pullback", {"action": "buy", "ticker": "MSFT", "timing": "pullback"}),
]
```

## Metrics to Capture
| Metric | gpt-4o-mini | FunctionGemma |
|--------|-------------|---------------|
| Tool call success rate | ? | ? |
| Field extraction accuracy | ? | ? |
| Schema compliance | ? | ? |
| Avg latency (ms) | ? | ? |

## Success Criteria
- ≥95% correct tool calls (right function, valid args)
- ≥90% field accuracy (correct values extracted)
- <100ms average latency

## Deliverables
- [ ] Benchmark script `tests/benchmarks/llm_parser_benchmark.py`
- [ ] Results documented in `docs/08_research/`
- [ ] Go/no-go recommendation

## Dependencies
- #533 (Ollama infrastructure)
- #534 (LLM backend abstraction)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: benchmark AutoGenLLMParser with FunctionGemma vs gpt-4o-mini #536

Summary

Scope

FunctionTool Schema

Test Cases (Tool Dispatch Only)

Metrics to Capture

Success Criteria

Deliverables

Dependencies

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Metric	gpt-4o-mini	FunctionGemma
Tool call success rate	?	?
Field extraction accuracy	?	?
Schema compliance	?	?
Avg latency (ms)	?	?

test: benchmark AutoGenLLMParser with FunctionGemma vs gpt-4o-mini #536

Description

Summary

Scope

FunctionTool Schema

Test Cases (Tool Dispatch Only)

Metrics to Capture

Success Criteria

Deliverables

Dependencies

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions