Onboarding SimpleQA #184

vidhishanair · 2025-10-17T21:08:05Z

This PR onboards SimpleQA based on https://github.com/openai/simple-evals/blob/main/simpleqa_eval.py

luizvalle · 2025-11-12T23:01:38Z

eureka_ml_insights/data_utils/simpleqa_utils.py

+from .transform import DFTransformBase
+
+@dataclass
+class SimpleQA_MetadataExplode(DFTransformBase):


nit: Would it be good to document what is the expected format of the metadata column? Just so it's easier for someone who looks at this file to know what it should look like without having to look at other parts of the code.

luizvalle · 2025-11-12T23:01:42Z

eureka_ml_insights/data_utils/simpleqa_utils.py

+    metadata_column: str
+
+    def transform(self, df: pd.DataFrame) -> pd.DataFrame:
+        df[self.metadata_column] = df[self.metadata_column].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)


If x is an invalid Python literal, ast.literal_eval() will raise an exception. Do we want to handle this gracefully here or is it okay to let the program crash in this case?

luizvalle · 2025-11-12T23:01:45Z

eureka_ml_insights/data_utils/simpleqa_utils.py

+        return df
+
+    def explode_metadata(self, df):
+        # TODO this would break if the first row does not have all the metrics, e.g. due invalid inference results


If the first row has a None, there is will be an exception since None does not have a keys() attribute. But, seems like this case should not happen. Flagging just in case.

luizvalle · 2025-11-12T23:14:33Z

eureka_ml_insights/metrics/simpleqa_metrics.py

+    def __init__(self, is_correct_column_name, is_incorrect_column_name, is_not_attempted_column_name, output_dir, group_by=None, **kwargs):
+        """
+        args:
+            - is_correct_column_name (str): The name of the column containing the correct values.


Is this column a boolean indicating whether the response was correct or a count of correct responses? May be good to clarify this in the docstring.

luizvalle · 2025-11-12T23:14:38Z

eureka_ml_insights/metrics/simpleqa_metrics.py

+
+    def process_row(self, row):
+        grading_response = row["model_output"]
+        if grading_response is None or str(grading_response)=="nan":


nit: An alternative to

str(grading_response)=="nan"

could be

pd.isna(grading_response)

I think this would usually be safer, but up to you.

luizvalle · 2025-11-17T20:15:33Z

eureka_ml_insights/prompt_templates/simpleqa_templates/simpleqa_grader_prompt.jinja

@@ -0,0 +1,78 @@
+Your job is to look at a question, a gold target, and a predicted answer, and then assign a grade of either ["CORRECT", "INCORRECT", "NOT_ATTEMPTED"].


nit: This instruction seems slightly misaligned with the final one:

Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
C: NOT_ATTEMPTED

Just return the letters "A", "B", or "C", with no text around it.

luizvalle · 2025-11-17T20:15:37Z

eureka_ml_insights/prompt_templates/simpleqa_templates/simpleqa_grader_prompt.jinja

+    - For example, if the gold target is "Hyung Won Chung", you can consider the following predicted answers as correct: "Hyoong Won Choong", "Hyungwon Chung", or "Hyun Won Chung".
+
+
+Here is a new example. Simply reply with either CORRECT, INCORRECT, NOT ATTEMPTED. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.


Same as above: The model should output A, B, or C but this instruction tells the model to output CORRECT, INCORRECT, NOT ATTEMPTED>

luizvalle · 2025-11-17T20:18:59Z

eureka_ml_insights/user_configs/simpleqa.py

+                   "path": "lighteval/SimpleQA",
+                   "split": "test",
+                   "transform": SequenceTransform([
+                    SamplerTransform(sample_count=100, random_seed=42),


Should we remove the sampling before submitting?

luizvalle · 2025-11-17T20:38:32Z

eureka_ml_insights/user_configs/simpleqa.py

+                                    "SimpleQA_Metric_grade": "SimpleQA_Metric_grade_onerun",
+                                }
+                            ),
+                            AddColumn("SimpleQA_Metric_grade"),


Just for my understanding, why are these two transforms needed? Couldn't we just use SimpleQA_Metric_grade?

luizvalle · 2025-11-17T20:39:47Z

eureka_ml_insights/user_configs/simpleqa.py

+                            ),
+                            AddColumn("SimpleQA_Metric_grade"),
+                            MajorityVoteTransform(model_output_col="SimpleQA_Metric_grade_onerun", model_label_column="SimpleQA_Metric_is_correct"),
+                            RunPythonTransform("df = df.rename_axis(index={'data_point_id': 'data_point_id_idx'})"),


Also just so I can understand: Why is it needed to rename the index?

luizvalle · 2025-11-17T20:44:00Z

eureka_ml_insights/user_configs/simpleqa.py

+        self.data_processing_comp.data_reader_config.init_args["path"] = "google/simpleqa-verified"
+        self.data_processing_comp.data_reader_config.init_args["split"] = "eval"
+        num_transforms = len(self.data_processing_comp.data_reader_config.init_args["transform"].transforms)
+        self.data_processing_comp.data_reader_config.init_args["transform"].transforms = self.data_processing_comp.data_reader_config.init_args["transform"].transforms[0:num_transforms-1]  # remove last transform which is SimpleQA_MetadataExplode


Can we add a comment explaining why the explode is not necessary in this case? Maybe the verified dataset does not have the metadata column which is being exploded?

luizvalle · 2025-11-17T20:51:15Z

eureka_ml_insights/metrics/simpleqa_metrics.py

+        if grading_response is None or str(grading_response)=="nan":
+            grade_letter = "C"  # Default to "NOT_ATTEMPTED" if there is no grading response
+        else:
+            match = re.search(r"(A|B|C)", grading_response)


If the grading model outputs something other than only one of A, B, or C, then this regex may result in false positives.

For example:

import re re.search(r"(A|B|C)", "A grading response that has some of the letters B, C")

would match the article 'A' at the beginning of the sentence.

Maybe a more strict regex would do?

match = re.search(r"^(A|B|C)$", grading_response)

luizvalle · 2025-11-17T21:00:03Z

eureka_ml_insights/metrics/simpleqa_metrics.py

+        if self.group_by == 'data_repeat_id':
+            self._aggregate(data)
+        else:
+            original_group_by = self.group_by


Do we need this original_group_by variable? Could we just use self.group_by instead?

luizvalle · 2025-11-17T21:46:47Z

eureka_ml_insights/metrics/simpleqa_metrics.py

Maybe we can pull out some of the processing functions into helpers just to make each function smaller?

Example:

import re import pandas as pd from eureka_ml_insights.metrics.metrics_base import CompositeMetric from eureka_ml_insights.metrics.reports import NumericalAggregator GRADE_REGEX = re.compile(r"^(A|B|C)$") GRADE_DEFAULT = "C" def _parse_grade(response): """Extract grade A/B/C from the model response.""" if response is None or pd.isna(response): return GRADE_DEFAULT match = GRADE_REGEX.search(str(response)) return match.group(0) if match else GRADE_DEFAULT def _process_row(self, row): grade = _parse_grade(row["model_output"]) return { "grade": grade, "is_correct": grade == "A", "is_incorrect": grade == "B", "is_not_attempted": grade == "C", } class SimpleQA_Metric(CompositeMetric): """ Composite metric for evaluating SimpleQA responses. """ def __evaluate__(self, row): return self._process_row(row) class SQA_CGAAggregator(NumericalAggregator): """ Computes accuracy = correct / attempted, where attempted = correct + incorrect. """ def __init__(self, correct_col, incorrect_col, not_attempted_col, output_dir, group_by=None, **kwargs): super().__init__( [correct_col, incorrect_col, not_attempted_col], output_dir, group_by=group_by, **kwargs, ) self.correct_col = correct_col self.incorrect_col = incorrect_col self.not_attempted_col = not_attempted_col def _compute_accuracy(self, df): attempted = df[self.correct_col].sum() + df[self.incorrect_col].sum() if attempted == 0: return 0.0 return df[self.correct_col].sum() / attempted def _aggregate(self, data): self.aggregated_result = { "accuracy_given_attempted": self._compute_accuracy(data) } def _aggregate_grouped(self, data): grouped = data.groupby(self.group_by) results = {} for name, group in grouped: results[name] = self._compute_accuracy(group) self.aggregated_result = {"accuracy_given_attempted": results} class SQA_CGAAvgPass1Aggregator(SQA_CGAAggregator): """ Computes accuracy per repeat (if present), then averages across repeats. """ def _aggregate(self, data): # Handle pass-1 repeat grouping if "data_repeat_id" not in data.columns: return super()._aggregate(data) # Temporarily override grouping self.group_by = "data_repeat_id" super()._aggregate_grouped(data) group_results = self.aggregated_result["accuracy_given_attempted"].values() mean_acc = sum(group_results) / len(group_results) if group_results else 0.0 self.aggregated_result = {"accuracy_given_attempted": mean_acc} def _aggregate_grouped(self, data): # If grouping by repeat, use the special logic above if self.group_by == "data_repeat_id": return self._aggregate(data) # Otherwise, compute accuracy within each external group grouped = data.groupby(self.group_by) results = {} for name, group in grouped: super()._aggregate(group) results[name] = self.aggregated_result["accuracy_given_attempted"] self.aggregated_result = {"accuracy_given_attempted": results}

Vidhisha Balachandran added 7 commits September 29, 2025 23:41

minor updates to configs and files

729ac7f

Merge branch 'main' of https://github.com/microsoft/eureka-ml-insights

205af33

Merge branch 'main' of https://github.com/microsoft/eureka-ml-insights

cdf6267

Merge branch 'main' of https://github.com/microsoft/eureka-ml-insights

9361f96

Merge branch 'main' of https://github.com/microsoft/eureka-ml-insights

12b0317

adding simpleqa benchmark

158041c

updates to simpleqa

2bf7b40

vidhishanair requested review from sgunasekar and vibhav-vineet October 17, 2025 21:56

adding simpleqa verified

aefbb5a

vidhishanair requested a review from luizvalle November 12, 2025 22:38

luizvalle reviewed Nov 12, 2025

View reviewed changes

luizvalle reviewed Nov 17, 2025

View reviewed changes

		@@ -0,0 +1,78 @@
		Your job is to look at a question, a gold target, and a predicted answer, and then assign a grade of either ["CORRECT", "INCORRECT", "NOT_ATTEMPTED"].

		- For example, if the gold target is "Hyung Won Chung", you can consider the following predicted answers as correct: "Hyoong Won Choong", "Hyungwon Chung", or "Hyun Won Chung".


		Here is a new example. Simply reply with either CORRECT, INCORRECT, NOT ATTEMPTED. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.

Onboarding SimpleQA #184

Are you sure you want to change the base?

Onboarding SimpleQA #184

Uh oh!

Conversation

vidhishanair commented Oct 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants