Skip to content

Conversation

@PeterWangyi
Copy link
Collaborator

Feature:

  • Support llm matching for MCQ && VQA items
  • Enabe EASI related benchmarks use llm matching by specify judge_kwargs['model']
  • Supports rule-based matching of NA up to 20 for matching English results

Refactor:

  • Refactor regex answer parsing and improve comments

@PeterWangyi PeterWangyi requested a review from yl-1993 December 3, 2025 12:48
@ttxskk ttxskk requested a review from Copilot December 3, 2025 13:16
Copilot finished reviewing on behalf of ttxskk December 3, 2025 13:18
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces LLM-based answer matching for MCQ and VQA tasks and enhances rule-based matching to support English number words (zero through twenty). The implementation provides a flexible scoring system that can automatically switch between rule-based and LLM-based evaluation based on configuration parameters.

Key Changes:

  • Added LLM-based extraction and grading with comprehensive prompt engineering for both MCQ and VQA tasks
  • Extended rule-based NA matching to recognize English number words (0-20) using num2words library
  • Refactored answer extraction logic with improved regex patterns and detailed inline documentation

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
vsibench.py Updated to use factory functions for dynamic scoring method selection
viewspatialbench.py Integrated build_mcq_score_fn for flexible MCQ scoring
tools/utils.py New utility module for choice extraction from various data formats
matching_func.py Enhanced regex patterns, added normalize_number_words for EN word-to-number conversion, improved documentation
llm_extract.py New module implementing LLM-based answer extraction with detailed grading prompt
cal_scores.py Added LLM scoring functions, factory methods (build_mcq_score_fn, build_na_score_fn), and scoring strategy selection logic
starebench.py Integrated build_mcq_score_fn for configurable scoring
spatialvizbench.py Updated to use factory-based scoring selection
sparbench.py Added support for both MCQ and NA LLM-based scoring
sitebench.py Integrated build_mcq_score_fn with minor typo in comment
omnispatialbench.py Updated to use factory-based scoring selection
mmsibench.py Integrated build_mcq_score_fn for flexible MCQ scoring
mindcubebench.py Updated to use factory-based scoring selection
embspatialbench.py Integrated build_mcq_score_fn for configurable scoring

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

continue
cleaned_lines.append(ln)
cleaned = "\n".join(cleaned_lines)

Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type hint dict[str, str] on line 193 uses Python 3.9+ syntax. Consider using Dict[str, str] from typing for better backward compatibility, as the file already imports Dict from typing on line 9.

Copilot uses AI. Check for mistakes.
rf'[\(\[\{{(【]?\s*' # optional left bracket + spaces
rf'([{letters}])\s*' # one letter from `letters`
rf'[\)\]\}})】]?' # optional right bracket
rf'(?![A-Za-z])' # right side not a letter
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Inconsistent use of raw string literals. Lines 86-91 and 94-99 use raw f-strings (rf'...') for regex patterns, which is appropriate. However, line 125 uses rf'(?<![A-Za-z])' with rf but the pattern doesn't actually need the f prefix since there are no format variables. For consistency, all pure regex patterns without format variables (like lines 86, 94, 125, 196, 200) should use r'...' instead of rf'...'.

Suggested change
rf'(?![A-Za-z])' # right side not a letter
r'(?![A-Za-z])' # right side not a letter

Copilot uses AI. Check for mistakes.

def evaluate(self, eval_file, **judge_kwargs):
from .utils.spatial_bench.cal_scores import compute_mcq_score, eval_mcq_core
from .utils.spatial_bench.cal_scores import eval_mcq_core, build_mcq_score_fn
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eval_mcq_core or eval_mcq_score?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 105 to 108
if judge_mode == 'llm':
judge_tag = f"llm_{judge_model}" if judge_model else "llm_matching"
else:
judge_tag = "extract_matching"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 lines to 3 lines

judge_tag = "extract_matching"
if judge_mode == 'llm':
    judge_tag = f"llm_{judge_model}" if judge_model else "llm_matching"      

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I think we may unify the use of 'x' / "x".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, and quotation mark issue will be refactored in the future.

Comment on lines +167 to +176
# Decide Excel filename according to actual judge type
judge_mode = getattr(score_fn, 'judge_mode', 'rule')
judge_model = getattr(score_fn, 'judge_model', None)

if judge_mode == 'llm':
judge_tag = f"llm_{judge_model}" if judge_model else "llm_matching"
else:
judge_tag = "extract_matching"

xlsx_path = f"{base_no_suffix}_{judge_tag}.xlsx"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the logic of generating xlsx_path appear everywhere, i.e., in each benchmark and also in common function

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, we can extract the general logic here.
Currently, non-MCQ benches need to be manually specified, while MCQ benches use the generic eval MCQ score function, so it looks like it's everywhere.

r'(?:\s*[\..::\)\]】、])?' # Optional trailing punctuation, e.g. A. / A: / A) etc.
r'.*?'
r'<\s*/\s*answer\s*>',
flags=re.IGNORECASE | re.DOTALL,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really appreciate hard work in crafting such complex regex! In future PRs, we may add some unit tests to ensure all behaviors function as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants