Skip to content

design: Add 0004-multimodal-i2t proposal#674

Open
sangminwoo wants to merge 5 commits intostrands-agents:mainfrom
sangminwoo:main
Open

design: Add 0004-multimodal-i2t proposal#674
sangminwoo wants to merge 5 commits intostrands-agents:mainfrom
sangminwoo:main

Conversation

@sangminwoo
Copy link
Copy Markdown

@sangminwoo sangminwoo commented Mar 17, 2026

Description

Add design doc for multimodal image-to-text evaluation support in strands-evals SDK.

Introduces MultimodalOutputEvaluator extending OutputEvaluator to enable MLLM-as-a-Judge evaluation for multimodal tasks, starting with image/document-to-text. The evaluator constructs multimodal prompts using strands SDK ContentBlock format and supports both reference-free and reference-based evaluation with automatic rubric selection across four dimensions: Overall Quality (P0), Correctness (P0), Faithfulness (P1), and Instruction Following (P1).

Key design decisions:

  • Extends OutputEvaluator with same Agent.__call__ invocation pattern (accepts both str and list[ContentBlock])
  • Automatic reference-based rubric selection via _select_rubric() when expected_output is provided
  • InputT=MultimodalInput (TypedDict) carries {"media": ImageData/AnyMediaData, "instruction": str} (modality-generic naming for future extensibility)
  • ImageData supports file paths, base64, data URLs, HTTP URLs (auto-fetched via urllib.request), S3 URIs (auto-fetched via boto3), bytes, and PIL Images
  • Built-in rubric templates + convenience subclasses per dimension

Related Issues

Type of Change

  • New content

Checklist

  • I have read the CONTRIBUTING document
  • My changes follow the project's documentation style
  • I have tested the documentation locally using npm run dev
  • Links in the documentation are valid and working

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@sangminwoo sangminwoo marked this pull request as draft March 18, 2026 00:08
@sangminwoo sangminwoo marked this pull request as ready for review March 18, 2026 00:08
afarntrog
afarntrog previously approved these changes Apr 1, 2026
@sangminwoo
Copy link
Copy Markdown
Author

Hi @afarntrog, this PR is ready for the final review. I've updated the design doc to reflect your comments: 1/ added support for remote URIs/URLs and 2/ broadened the multimodality class to accommodate future media types. Would appreciate an approval when you get a chance so we can get this merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants