Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add normalization of mutual stats to meritrank-service "MUTUAL_SCORES" #10

Open
ichorid opened this issue Oct 3, 2024 · 2 comments · May be fixed by #24
Open

Add normalization of mutual stats to meritrank-service "MUTUAL_SCORES" #10

ichorid opened this issue Oct 3, 2024 · 2 comments · May be fixed by #24
Assignees

Comments

@ichorid
Copy link
Contributor

ichorid commented Oct 3, 2024

CMD_MUTUAL_SCORES => {

This command is used in Tentura to show mutual rankings of the ego and peers, e.g., if the ego is Alice, what is the score of Bob from the standpoint of Alice (normal direct MR), and what is the score of Alice from the standpoint of Bob (reverse MR). The idea is that the user is able to evaluate the level of "reciprocity" with their peers. In Tentura, we use a special "dumbbell" representation to show it visually:
image
The widget indicates the "intensity" of interaction by the color of the "dumbbell", and the "weights" of personal preferences by sizes of the circles 🟢-🟢.

The normalization problem

The problem is that MR scores can have a very high dynamic range [0,1], which makes them very hard to compare due to their very different magnitudes. For instance, Bob's score from Alice's standpoint could be 0.01, while Alice's score from Bob's standpoint could be 0.00001—which translates to "dumbbells" being always the same, minimal size.
The mutual scores could be described as a kind of "scatter plot" with all the points concentrated near zero and highly skewed from diagonal
Диаграмма без названия drawio

So, we need to come up with a way to normalize those in the [0.0-1.0] range in a way that will make a visual difference when visualized with the "dumbbell" method.

@vberezhnev vberezhnev self-assigned this Oct 4, 2024
@ichorid
Copy link
Contributor Author

ichorid commented Oct 11, 2024

One possible approach, as indicated by @alexandrim0 is to apply k-means to group a fixed number (e.g., first N) score results into a fixed number of groups (k) and assign "grades" to nodes based on its group number. Something like:

get_scores(N=10, k=3) - "get first N=10 scores and group them into k=3 grades"
result is:

  • "alice": 1
  • "bob": 1
  • "carol": 2
  • "sybil": 3

@ichorid
Copy link
Contributor Author

ichorid commented Nov 4, 2024

Recalculating k-means groupings for all the scores from an ego's standpoint is too much: we need to optimize/approximate the process. One way to do it is as follows:

  1. only do full k-means grouping rarely, and
  2. cache "group bounds" after each full grouping (e.g. |---g1--|-g2---|----g3|) - it is possible due to our data being 1-dimensional
  3. on object retrieval compare the score of the object to the bounds and assign it to the corresponding group.

Of course, the algorithm is an approximation, and it is only valid under the hypothesis that the distribution of scores for an ego changes slowly. (I suggest talking to ChatGPT about devising proper formulae for calculating the group boundaries, and for estimating when to do a full calculation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants