Hi maintainers,
While reproducing the repository with various image_model providers (including proxy vendors), I found that base/engine/async_llm.py currently assumes a specific response schema. In practice, different LLM/VLM providers return images in varying formats, which currently leads to parsing errors or requires provider-specific hacks in downstream code.
Current implementations often fail when encountering:
- Direct URLs: {"url": "https://..."}
- Base64 Strings: {"b64_json": "..."}
- Data URIs: {"image": "data:image/png;base64,..."}
- Nested/Custom Structures: Some SDKs wrap outputs in
output.images or return raw binary blobs like {"output": {"images": [...]}}.
I propose adding a normalization layer in base/engine/async_llm.py, which would detect the incoming format automatically and unify these diverse responses into a consistent internal representation (e.g., always converting to a standard PIL object or a consistent Base64 format).
This enhancement would significantly improve the flexibility of the engine and make it easier to integrate with different inference backends.
Thanks!