[Feature] Support llm matching && upgrade rule-based matching #22

PeterWangyi · 2025-12-03T12:47:34Z

Feature:

Support llm matching for MCQ && VQA items
Enabe EASI related benchmarks use llm matching by specify judge_kwargs['model']
Supports rule-based matching of NA up to 20 for matching English results

Refactor:

Refactor regex answer parsing and improve comments

…atch_func

Copilot

Pull request overview

This PR introduces LLM-based answer matching for MCQ and VQA tasks and enhances rule-based matching to support English number words (zero through twenty). The implementation provides a flexible scoring system that can automatically switch between rule-based and LLM-based evaluation based on configuration parameters.

Key Changes:

Added LLM-based extraction and grading with comprehensive prompt engineering for both MCQ and VQA tasks
Extended rule-based NA matching to recognize English number words (0-20) using num2words library
Refactored answer extraction logic with improved regex patterns and detailed inline documentation

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
`vsibench.py`	Updated to use factory functions for dynamic scoring method selection
`viewspatialbench.py`	Integrated build_mcq_score_fn for flexible MCQ scoring
`tools/utils.py`	New utility module for choice extraction from various data formats
`matching_func.py`	Enhanced regex patterns, added normalize_number_words for EN word-to-number conversion, improved documentation
`llm_extract.py`	New module implementing LLM-based answer extraction with detailed grading prompt
`cal_scores.py`	Added LLM scoring functions, factory methods (build_mcq_score_fn, build_na_score_fn), and scoring strategy selection logic
`starebench.py`	Integrated build_mcq_score_fn for configurable scoring
`spatialvizbench.py`	Updated to use factory-based scoring selection
`sparbench.py`	Added support for both MCQ and NA LLM-based scoring
`sitebench.py`	Integrated build_mcq_score_fn with minor typo in comment
`omnispatialbench.py`	Updated to use factory-based scoring selection
`mmsibench.py`	Integrated build_mcq_score_fn for flexible MCQ scoring
`mindcubebench.py`	Updated to use factory-based scoring selection
`embspatialbench.py`	Integrated build_mcq_score_fn for configurable scoring

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-03T13:20:39Z

vlmeval/dataset/utils/spatial_bench/matching_func.py

            continue
        cleaned_lines.append(ln)
    cleaned = "\n".join(cleaned_lines)



The type hint dict[str, str] on line 193 uses Python 3.9+ syntax. Consider using Dict[str, str] from typing for better backward compatibility, as the file already imports Dict from typing on line 9.

vlmeval/dataset/utils/spatial_bench/cal_scores.py

Copilot · 2025-12-03T13:20:40Z

vlmeval/dataset/utils/spatial_bench/matching_func.py

+        rf'[\(\[\{{（【]?\s*'       # optional left bracket + spaces
+        rf'([{letters}])\s*'       # one letter from `letters`
+        rf'[\)\]\}}）】]?'          # optional right bracket
+        rf'(?![A-Za-z])'           # right side not a letter


[nitpick] Inconsistent use of raw string literals. Lines 86-91 and 94-99 use raw f-strings (rf'...') for regex patterns, which is appropriate. However, line 125 uses rf'(?<![A-Za-z])' with rf but the pattern doesn't actually need the f prefix since there are no format variables. For consistency, all pure regex patterns without format variables (like lines 86, 94, 125, 196, 200) should use r'...' instead of rf'...'.

Suggested change

rf'(?![A-Za-z])' # right side not a letter

r'(?![A-Za-z])' # right side not a letter

vlmeval/dataset/utils/spatial_bench/matching_func.py

vlmeval/dataset/utils/spatial_bench/cal_scores.py

vlmeval/dataset/utils/spatial_bench/llm_extract.py

vlmeval/dataset/utils/spatial_bench/tools/utils.py

yl-1993 · 2025-12-06T14:47:58Z

vlmeval/dataset/embspatialbench.py


    def evaluate(self, eval_file, **judge_kwargs):
-        from .utils.spatial_bench.cal_scores import compute_mcq_score, eval_mcq_core
+        from .utils.spatial_bench.cal_scores import eval_mcq_core, build_mcq_score_fn


eval_mcq_core or eval_mcq_score?

yl-1993 · 2025-12-06T14:51:36Z

vlmeval/dataset/sitebench.py

+        if judge_mode == 'llm':
+            judge_tag = f"llm_{judge_model}" if judge_model else "llm_matching"
+        else:
+            judge_tag = "extract_matching"


4 lines to 3 lines

judge_tag = "extract_matching" if judge_mode == 'llm': judge_tag = f"llm_{judge_model}" if judge_model else "llm_matching"

BTW, I think we may unify the use of 'x' / "x".

fixed, and quotation mark issue will be refactored in the future.

yl-1993 · 2025-12-06T14:58:23Z

vlmeval/dataset/utils/spatial_bench/cal_scores.py

+    # Decide Excel filename according to actual judge type
+    judge_mode = getattr(score_fn, 'judge_mode', 'rule')
+    judge_model = getattr(score_fn, 'judge_model', None)
+
+    if judge_mode == 'llm':
+        judge_tag = f"llm_{judge_model}" if judge_model else "llm_matching"
+    else:
+        judge_tag = "extract_matching"
+
+    xlsx_path = f"{base_no_suffix}_{judge_tag}.xlsx"


Why the logic of generating xlsx_path appear everywhere, i.e., in each benchmark and also in common function

Make sense, we can extract the general logic here.
Currently, non-MCQ benches need to be manually specified, while MCQ benches use the generic eval MCQ score function, so it looks like it's everywhere.

yl-1993 · 2025-12-06T15:06:12Z

vlmeval/dataset/utils/spatial_bench/matching_func.py

+    r'(?:\s*[\.．:：\)\]】、])?'  # Optional trailing punctuation, e.g. A. / A: / A) etc.
+    r'.*?'
+    r'<\s*/\s*answer\s*>',
+    flags=re.IGNORECASE | re.DOTALL,


Really appreciate hard work in crafting such complex regex! In future PRs, we may add some unit tests to ensure all behaviors function as expected.

PeterWangyi added 14 commits December 1, 2025 06:20

[Refactor] Refactor regex answer parsing and improve comments

b27e7d2

[Feature] use last number instead of first to match na options

ae8bacd

[Feature] support English number words in NA matcher

485f04c

[Feature] add tools to build options from xlsx rows

f3cbc33

[Feature] support llm matching for both mcq and vqa

2a77ba1

[Feature] support parallel llm judge and support na llm extract

5fb78f6

[Fix] fix coner case when options in mutliple lines

de40f82

Merge remote-tracking branch 'upstream/EASI' into dev/peter/upgrade_m…

36306d2

…atch_func

[Feature] add matching func factory

fb6f9bb

[Fix] llm extract fetching problem

e678f9f

[Refactor] improve func naming

d6b7338

[Feature] support LLM matching

c16d761

[Refactor] extract common func

d132272

[Refactor] remove useless content

07ae303

PeterWangyi requested a review from yl-1993 December 3, 2025 12:48

ttxskk requested a review from Copilot December 3, 2025 13:16

Copilot started reviewing on behalf of ttxskk December 3, 2025 13:17 View session

Copilot finished reviewing on behalf of ttxskk December 3, 2025 13:18

Copilot AI reviewed Dec 3, 2025

View reviewed changes

PeterWangyi added 3 commits December 3, 2025 13:30

[Refactor] modify the content of the copilot check.

fd96d37

[Fix] construct type error

87c2c36

[Feature] determine result file name by judge model name

97ea5ad

yl-1993 reviewed Dec 6, 2025

View reviewed changes

[Refactor] fix naming of eval mcq func

9cdfdf2

yl-1993 reviewed Dec 6, 2025

View reviewed changes

[Refactor] improve code style.

a3dbb90

yl-1993 reviewed Dec 6, 2025

View reviewed changes

	rf'(?![A-Za-z])' # right side not a letter
	r'(?![A-Za-z])' # right side not a letter

[Feature] Support llm matching && upgrade rule-based matching #22

Are you sure you want to change the base?

[Feature] Support llm matching && upgrade rule-based matching #22

Uh oh!

Conversation

PeterWangyi commented Dec 3, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants