Skip to content

Conversation

emersodb
Copy link
Collaborator

PR Type

Feature

Short Description

Clickup Ticket(s): https://app.clickup.com/t/868fge8mt

Adding in two additional quality metrics from SynthEval to the library. These are the penultimate quality metrics to be added (just need an F1 measure in the next ticket to close off the quality measures).

NOTE: The SynthEval implementation of the Hellinger distance for numerical columns was flawed. So I brought the implementation in-house and fixed the issue.

Tests Added

A number of tests have been created to make sure the measures work as expected and produce the correct values across a number of situations.

"D104", # Ignore package level docstrings requirement
"D205", # 1 blank line required between summary line and description
"D212", # Multi-line docstring summary should start at the first line
"D301", # r-strings for docstrings with backslashes
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be brought in if we're going to have latex in our docstrings.

@emersodb emersodb marked this pull request as ready for review September 17, 2025 13:35
@emersodb emersodb changed the base branch from main to dbe/fixing_mypy September 24, 2025 15:22
Base automatically changed from dbe/fixing_mypy to main September 24, 2025 15:57
Copy link
Collaborator

@bzamanlooy bzamanlooy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just has a few questions that I added in the review :)

Copy link
Collaborator

@fatemetkl fatemetkl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thorough tests and very easy-to-follow documentation.
I’ve just added a few minor comments.

"""
Compute the empirical Hellinger distance between two discrete probability distributions. Hellinger distance for
discrete probability distributions $p$ and $q$ is expressed as
$$\\frac{1}{2} \\cdot \\Vert \\sqrt{p} - \\sqrt{q} \\Vert_2$$.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: $$\frac{1}{\sqrt{2}}$$ in this docstring.

Comment on lines +64 to +65
``do_preprocess`` is True, the default Syntheval pipeline is used, which does NOT one-hot encode the
categoricals.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can say "which performs "OrdinalEncoding" for categorical columns and "MinMaxScaling" for numerical columns." to clarify what exactly happens.

The mean of the individual Hellinger distances between each of the corresponding columns of the real and
synthetic dataframes. This mean is keyed by 'mean_hellinger_distance' and is reported along with the
"standard error" associated with that mean keyed under 'hellinger_standard_error'.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not for this PR, and it might already be covered, but at some point it would be nice to check that the categorical and numerical columns actually exist in the provided real and synthetic data.

distance = hellinger_distance(real_discrete_pdf, synthetic_discrete_pdf)
hellinger_distances.append(distance)

mean_hellinger_distance = np.mean(hellinger_distances).item()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are no categorical columns and include_numerical_columns is False, it will return nan. So, we might want to have a warning message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants