-
Notifications
You must be signed in to change notification settings - Fork 0
[Feature] Support llm matching && upgrade rule-based matching #22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: EASI
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces LLM-based answer matching for MCQ and VQA tasks and enhances rule-based matching to support English number words (zero through twenty). The implementation provides a flexible scoring system that can automatically switch between rule-based and LLM-based evaluation based on configuration parameters.
Key Changes:
- Added LLM-based extraction and grading with comprehensive prompt engineering for both MCQ and VQA tasks
- Extended rule-based NA matching to recognize English number words (0-20) using num2words library
- Refactored answer extraction logic with improved regex patterns and detailed inline documentation
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
vsibench.py |
Updated to use factory functions for dynamic scoring method selection |
viewspatialbench.py |
Integrated build_mcq_score_fn for flexible MCQ scoring |
tools/utils.py |
New utility module for choice extraction from various data formats |
matching_func.py |
Enhanced regex patterns, added normalize_number_words for EN word-to-number conversion, improved documentation |
llm_extract.py |
New module implementing LLM-based answer extraction with detailed grading prompt |
cal_scores.py |
Added LLM scoring functions, factory methods (build_mcq_score_fn, build_na_score_fn), and scoring strategy selection logic |
starebench.py |
Integrated build_mcq_score_fn for configurable scoring |
spatialvizbench.py |
Updated to use factory-based scoring selection |
sparbench.py |
Added support for both MCQ and NA LLM-based scoring |
sitebench.py |
Integrated build_mcq_score_fn with minor typo in comment |
omnispatialbench.py |
Updated to use factory-based scoring selection |
mmsibench.py |
Integrated build_mcq_score_fn for flexible MCQ scoring |
mindcubebench.py |
Updated to use factory-based scoring selection |
embspatialbench.py |
Integrated build_mcq_score_fn for configurable scoring |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| continue | ||
| cleaned_lines.append(ln) | ||
| cleaned = "\n".join(cleaned_lines) | ||
|
|
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type hint dict[str, str] on line 193 uses Python 3.9+ syntax. Consider using Dict[str, str] from typing for better backward compatibility, as the file already imports Dict from typing on line 9.
| rf'[\(\[\{{(【]?\s*' # optional left bracket + spaces | ||
| rf'([{letters}])\s*' # one letter from `letters` | ||
| rf'[\)\]\}})】]?' # optional right bracket | ||
| rf'(?![A-Za-z])' # right side not a letter |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Inconsistent use of raw string literals. Lines 86-91 and 94-99 use raw f-strings (rf'...') for regex patterns, which is appropriate. However, line 125 uses rf'(?<![A-Za-z])' with rf but the pattern doesn't actually need the f prefix since there are no format variables. For consistency, all pure regex patterns without format variables (like lines 86, 94, 125, 196, 200) should use r'...' instead of rf'...'.
| rf'(?![A-Za-z])' # right side not a letter | |
| r'(?![A-Za-z])' # right side not a letter |
vlmeval/dataset/embspatialbench.py
Outdated
|
|
||
| def evaluate(self, eval_file, **judge_kwargs): | ||
| from .utils.spatial_bench.cal_scores import compute_mcq_score, eval_mcq_core | ||
| from .utils.spatial_bench.cal_scores import eval_mcq_core, build_mcq_score_fn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eval_mcq_core or eval_mcq_score?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
vlmeval/dataset/sitebench.py
Outdated
| if judge_mode == 'llm': | ||
| judge_tag = f"llm_{judge_model}" if judge_model else "llm_matching" | ||
| else: | ||
| judge_tag = "extract_matching" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 lines to 3 lines
judge_tag = "extract_matching"
if judge_mode == 'llm':
judge_tag = f"llm_{judge_model}" if judge_model else "llm_matching" There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I think we may unify the use of 'x' / "x".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, and quotation mark issue will be refactored in the future.
| # Decide Excel filename according to actual judge type | ||
| judge_mode = getattr(score_fn, 'judge_mode', 'rule') | ||
| judge_model = getattr(score_fn, 'judge_model', None) | ||
|
|
||
| if judge_mode == 'llm': | ||
| judge_tag = f"llm_{judge_model}" if judge_model else "llm_matching" | ||
| else: | ||
| judge_tag = "extract_matching" | ||
|
|
||
| xlsx_path = f"{base_no_suffix}_{judge_tag}.xlsx" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the logic of generating xlsx_path appear everywhere, i.e., in each benchmark and also in common function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense, we can extract the general logic here.
Currently, non-MCQ benches need to be manually specified, while MCQ benches use the generic eval MCQ score function, so it looks like it's everywhere.
| r'(?:\s*[\..::\)\]】、])?' # Optional trailing punctuation, e.g. A. / A: / A) etc. | ||
| r'.*?' | ||
| r'<\s*/\s*answer\s*>', | ||
| flags=re.IGNORECASE | re.DOTALL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really appreciate hard work in crafting such complex regex! In future PRs, we may add some unit tests to ensure all behaviors function as expected.
Feature:
Refactor: