Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wmedian: the first weight is ignored when ProbabilityWeights is used #933

Open
perrette opened this issue Jun 13, 2024 · 2 comments
Open

Comments

@perrette
Copy link

Hi, I clearly lack subtelty about the various definitions of weighted quantiles (and I passed quickly over the above discussion as a result), but I thought I'd share another, more obvious example of what what the current implementation is doing:

The frequency weights seems to do what I'd expect:

julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.FrequencyWeights([1000, 1, 1, 1]))
1.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.FrequencyWeights([1, 1000, 1, 1]))
2.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.FrequencyWeights([1, 1, 1000, 1]))
3.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.FrequencyWeights([1, 1, 1, 1000]))
4.0

unlike the ProbabilityWeights:

julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1000, 1, 1, 1]/1003))
2.500000000000056
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1, 1000, 1, 1]/1003))
1.501
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1, 1, 1000, 1]/1003))
2.5
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1, 1, 1, 1000]/1003))
3.499

which seems to:

  • ignore the first weight purely and simply (isn't that a bug ?? -- it's hard to believe it comes down to definition, unless it's a definition for continuous analytics applied bluntly to discrete arrays -- I ask the author(s) to please excuse me, I mean no offense)
  • takes the mean between whatever two numbers the (weighted) median falls into, which can never be exactly reached

Worse, it is numerically inaccurate:

julia> eps = 1e-50 ;
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1-3*eps, eps, eps, eps]))
4.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, 1-3*eps, eps, eps]))
1.5
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, eps, 1-3*eps, eps]))
2.5
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, eps, eps, 1-3*eps]))
3.5

whereas setting epsilon to exactly zero yields the same (and for me, expected) result as FrequencyWeights

julia> eps = 0.;
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1-3*eps, eps, eps, eps]))
1.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, 1-3*eps, eps, eps]))
2.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, eps, 1-3*eps, eps]))
3.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, eps, eps, 1-3*eps]))
4.0

IMO the above shows surprising results that may go beyond the difference between various definitions (especially that the first weight is ignored, and possibly the discontinuity at the limit when the weights tend toward being concentrated on one element, though OK, discontinuities are parts of mathematics -- but they often don't help when analyzing real data).

Anyway, for me the workaround will be to multiply my weights by a large number, convert to integers, and use FrequencyWeights instead of ProbabilityWeights.

Originally posted by @perrette in #435 (comment)

@perrette perrette changed the title The first weight is ignored when ProbabilityWeights is used wmedian: the first weight is ignored when ProbabilityWeights is used Jun 13, 2024
@matthieugomez
Copy link
Contributor

matthieugomez commented Jul 16, 2024

I am the author of the function agree with all these issues — they all come down to the arbitrariness of the first weight. Additionally, the part of the weighted quantile function where one removes the zero is probably wrong as it means that true zeroes and things equal to zero or close to zero will give discontinuous results.

That being said, I cannot think of a good fix. I guess I just really can't see how the quantile definition used by Julia can be naturally extended to weighted vector without implying weird edge cases. If you can find a more natural way, feel free to do a PR. Multiplying all weights by a large number so that they become integers would not really solve your issues since replacing a 0 weight by eps() would lead to very different results too.

@perrette
Copy link
Author

perrette commented Jul 25, 2024

Hi @matthieugomez, thanks for replying. I am not familiar with Julia 's quantile function, but if that can advance in any way this discussion, I use the following function in python:

import numpy as np

def weighted_quantiles(values, weights, quantiles, interpolate=False):
    """
    >>> weighted_quantiles(np.array([1, 2, 3, 4]), np.ones(4), 0.5)
    2 
    >>> weighted_quantiles(np.array([1, 2, 3, 4]), np.ones(4), 0.5, interpolate=True)
    2.5
    >>> weighted_quantiles(np.array([1, 2, 3, 4]), np.array([1000, 1, 1, 1]), 0.5)
    1
    >>> weighted_quantiles(np.array([1, 2, 3, 4]), np.array([1, 1000, 1, 1]), 0.5)
    2
    >>> weighted_quantiles(np.array([1, 2, 3, 4]), np.array([1, 1, 1000, 1]), 0.5)
    3
    >>> weighted_quantiles(np.array([1, 2, 3, 4]), np.array([1, 1, 1, 1000]), 0.5)    
    4
    >>> weighted_quantiles(np.array([1, 2, 3, 4]), np.array([1000, 1, 1, 1]), 0.5, interpolate=True)
    1.002997002997003
    >>> weighted_quantiles(np.array([1, 2, 3, 4]), np.array([1, 1000, 1, 1]), 0.5, interpolate=True)
    2.000999000999001
    >>> weighted_quantiles(np.array([1, 2, 3, 4]), np.array([1, 1, 1000, 1]), 0.5, interpolate=True)
    2.999000999000999
    >>> weighted_quantiles(np.array([1, 2, 3, 4]), np.array([1, 1, 1, 1000]), 0.5, interpolate=True)    
    3.9970029970029968
    """
    i = values.argsort()
    sorted_weights = weights[i]
    sorted_values = values[i]
    Sn = sorted_weights.cumsum()

    if interpolate:
        Pn = (Sn - sorted_weights/2 ) / Sn[-1]
        return np.interp(quantiles, Pn, sorted_values)
    else:
        return sorted_values[np.searchsorted(Sn, quantiles * Sn[-1])]

After a discussion here

The interpolated=True version serves my practical purpose well (I also have an equivalent, badly implemented Julia function that I do not dare share here).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants