Adding in Hellinger and pMSE metrics #38

emersodb · 2025-09-17T13:32:55Z

PR Type

Feature

Short Description

Clickup Ticket(s): https://app.clickup.com/t/868fge8mt

Adding in two additional quality metrics from SynthEval to the library. These are the penultimate quality metrics to be added (just need an F1 measure in the next ticket to close off the quality measures).

NOTE: The SynthEval implementation of the Hellinger distance for numerical columns was flawed. So I brought the implementation in-house and fixed the issue.

Tests Added

A number of tests have been created to make sure the measures work as expected and produce the correct values across a number of situations.

emersodb · 2025-09-17T13:33:25Z

pyproject.toml

    "D104", # Ignore package level docstrings requirement
    "D205", # 1 blank line required between summary line and description
    "D212", # Multi-line docstring summary should start at the first line
+    "D301", # r-strings for docstrings with backslashes


This needs to be brought in if we're going to have latex in our docstrings.

bzamanlooy

I just has a few questions that I added in the review :)

src/midst_toolkit/evaluation/quality/mean_hellinger_distance.py

src/midst_toolkit/evaluation/quality/mean_propensity_mse.py

fatemetkl

LGTM! Thorough tests and very easy-to-follow documentation.
I’ve just added a few minor comments.

fatemetkl · 2025-09-29T18:57:46Z

src/midst_toolkit/evaluation/quality/mean_hellinger_distance.py

+    """
+    Compute the empirical Hellinger distance between two discrete probability distributions. Hellinger distance for
+    discrete probability distributions $p$ and $q$ is expressed as
+    $$\\frac{1}{2} \\cdot \\Vert \\sqrt{p} - \\sqrt{q} \\Vert_2$$.


Minor: $$\frac{1}{\sqrt{2}}$$ in this docstring.

fatemetkl · 2025-09-29T22:07:05Z

src/midst_toolkit/evaluation/quality/mean_propensity_mse.py

+        ``do_preprocess`` is True, the default Syntheval pipeline is used, which does NOT one-hot encode the
+        categoricals.


Maybe we can say "which performs "OrdinalEncoding" for categorical columns and "MinMaxScaling" for numerical columns." to clarify what exactly happens.

fatemetkl · 2025-09-29T22:27:00Z

src/midst_toolkit/evaluation/quality/mean_hellinger_distance.py

+            The mean of the individual Hellinger distances between each of the corresponding columns of the real and
+            synthetic dataframes. This mean is keyed by 'mean_hellinger_distance' and is reported along with the
+            "standard error" associated with that mean keyed under 'hellinger_standard_error'.
+        """


Probably not for this PR, and it might already be covered, but at some point it would be nice to check that the categorical and numerical columns actually exist in the provided real and synthetic data.

fatemetkl · 2025-09-29T22:29:38Z

src/midst_toolkit/evaluation/quality/mean_hellinger_distance.py

+                distance = hellinger_distance(real_discrete_pdf, synthetic_discrete_pdf)
+                hellinger_distances.append(distance)
+
+        mean_hellinger_distance = np.mean(hellinger_distances).item()


If there are no categorical columns and include_numerical_columns is False, it will return nan. So, we might want to have a warning message.

First checkin of hellinger and pmse implementations

adae16e

emersodb requested review from amrit110, lotif, fatemetkl, sarakodeiri, masi-sh and bzamanlooy September 17, 2025 13:32

emersodb commented Sep 17, 2025

View reviewed changes

emersodb marked this pull request as ready for review September 17, 2025 13:35

emersodb added 4 commits September 17, 2025 09:43

Fix typing issue

a790991

Merge branch 'main' into dbe/add_hellinger_pmse

1705d50

New mypy flow and fixes to typing issues that were discovered

59ea7f4

Merge branch 'dbe/fixing_mypy' into dbe/add_hellinger_pmse

660729f

emersodb changed the base branch from main to dbe/fixing_mypy September 24, 2025 15:22

Base automatically changed from dbe/fixing_mypy to main September 24, 2025 15:57

emersodb added 3 commits September 24, 2025 11:57

Merge branch 'main' into dbe/add_hellinger_pmse

5bd360c

Merge branch 'main' into dbe/add_hellinger_pmse

8328547

Merge branch 'main' into dbe/add_hellinger_pmse

2de4692

bzamanlooy reviewed Sep 26, 2025

View reviewed changes

emersodb added 3 commits September 26, 2025 16:03

Merge branch 'main' into dbe/add_hellinger_pmse

8918c35

Merge branch 'main' into dbe/add_hellinger_pmse

b2c5de5

Addressing some PR comments

3defbf2

fatemetkl approved these changes Sep 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding in Hellinger and pMSE metrics #38

Adding in Hellinger and pMSE metrics #38

emersodb commented Sep 17, 2025

Uh oh!

emersodb Sep 17, 2025

Uh oh!

bzamanlooy left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fatemetkl left a comment

Uh oh!

fatemetkl Sep 29, 2025

Uh oh!

fatemetkl Sep 29, 2025

Uh oh!

fatemetkl Sep 29, 2025

Uh oh!

fatemetkl Sep 29, 2025

Uh oh!

Uh oh!

		``do_preprocess`` is True, the default Syntheval pipeline is used, which does NOT one-hot encode the
		categoricals.

Adding in Hellinger and pMSE metrics #38

Are you sure you want to change the base?

Adding in Hellinger and pMSE metrics #38

Conversation

emersodb commented Sep 17, 2025

PR Type

Short Description

Tests Added

Uh oh!

emersodb Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

bzamanlooy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fatemetkl left a comment

Choose a reason for hiding this comment

Uh oh!

fatemetkl Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

fatemetkl Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

fatemetkl Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

fatemetkl Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!