SKG on elasticsearch #196

lschneidpro · 2024-09-09T11:02:00Z

Hi everyone,

I'm currently reading the book but using Elasticsearch instead of Solr. I attempted to reimplement the Semantic Knowledge Graph (SKG) on ES, and developed a custom scoring script for Elasticsearch's significant text aggregation, inspired by the original Solr code found here. So far, I've been able to achieve the same scores as those in the health dataset example. I haven't tested the other cases from the book yet, but I wanted to share my implementation to see if it aligns with the authors' intentions.

script = """
double sigmoid(double x, double offset, double scale) {
    return (x+offset) / (scale + Math.abs(x+offset));
}

double bgProb = params._superset_freq*1.0/params._superset_size;
double num = (params._subset_freq - params._subset_size * bgProb);
double denom = Math.sqrt(params._subset_size * bgProb * (1 - bgProb));
denom = (denom == 0) ? 1e-10 : denom;
double z = num / denom;
double result = 0.2*sigmoid(z, -80, 50)
                + 0.2*sigmoid(z, -30, 30)
                + 0.2*sigmoid(z, 0, 30)
                + 0.2*sigmoid(z, 30, 30)
                + 0.2*sigmoid(z, 80, 50);
return Math.round(result * 1e5)/1e5;
"""

script_heuristic = {
    "script": {
        "lang": "painless",
        "source": script,
    }
}

query_string = "advil"
query = {"match": {"body": query_string}}
aggs = {
    "keywords": {
        "significant_text": {
            "field": "body",
            "min_doc_count": 2,
            "script_heuristic": script_heuristic,
        }
    }
}
resp = client.search(index=alias, query=query, aggs=aggs, size=0)

resulting in:

{'took': 92,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 15, 'relation': 'eq'},
  'max_score': None,
  'hits': []},
 'aggregations': {'keywords': {'doc_count': 15,
   'bg_count': 12892,
   'buckets': [{'key': 'advil',
     'doc_count': 15,
     'score': 0.70986,
     'bg_count': 15},
    {'key': 'motrin', 'doc_count': 9, 'score': 0.59897, 'bg_count': 10},
    {'key': 'aleve', 'doc_count': 4, 'score': 0.4662, 'bg_count': 4},
    {'key': 'ibuprofen', 'doc_count': 13, 'score': 0.38264, 'bg_count': 75},
    {'key': 'alleve', 'doc_count': 2, 'score': 0.36649, 'bg_count': 2},
    {'key': 'tylenol', 'doc_count': 6, 'score': 0.33048, 'bg_count': 23},
    {'key': 'naproxen', 'doc_count': 6, 'score': 0.31226, 'bg_count': 26}]}}}

I appreciate any feedback—thanks!

The text was updated successfully, but these errors were encountered:

lschneidpro · 2024-09-09T15:45:32Z

I’ve been testing the various cases from Chapter 5, and for the vibranium results, the scores are starting to change but remain quite similar overall (example attached).

query_string = "vibranium"
query = {"match": {"body": query_string}}
aggs = {
    "keywords": {
        "significant_text": {
            "field": "body",
            "min_doc_count": 2,
            "script_heuristic": script_heuristic,
        }
    }
}
alias="stackexchange"
resp = client.search(index=alias, query=query, aggs=aggs, size=0)
resp.body

{'took': 8,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 281, 'relation': 'eq'},
  'max_score': None,
  'hits': []},
 'aggregations': {'keywords': {'doc_count': 281,
   'bg_count': 1950545,
   'buckets': [{'key': 'vibranium',
     'doc_count': 280,
     'score': 0.95473,
     'bg_count': 843},
    {'key': 'wakandan', 'doc_count': 35, 'score': 0.87018, 'bg_count': 122},
    {'key': 'wakanda', 'doc_count': 48, 'score': 0.85652, 'bg_count': 284},
    {'key': "panther's", 'doc_count': 14, 'score': 0.85428, 'bg_count': 25},
    {'key': 'klaue', 'doc_count': 12, 'score': 0.85196, 'bg_count': 19},
    {'key': 'maclain', 'doc_count': 11, 'score': 0.84754, 'bg_count': 17},
    {'key': 'adamantium', 'doc_count': 93, 'score': 0.847, 'bg_count': 1221},
    {'key': 'klaw', 'doc_count': 15, 'score': 0.82973, 'bg_count': 40},
    {'key': 'panther', 'doc_count': 36, 'score': 0.82165, 'bg_count': 254},
    {'key': 'alloy', 'doc_count': 53, 'score': 0.81535, 'bg_count': 592}]}}}

As for the Star Wars content-based recommendation, I wasn’t entirely sure how to approach it. I tried incorporating different tokens from the text into the results, though it’s not ideal since I have to split tokens like ‘Princess Leia.’ Still, I’m seeing similar results (example attached). Let me know your thoughts

parsed_document = ["this", "doc", "contains", "the", "words", "luke", 
            "magneto", "cyclops", "darth", "vader", "princess","leia", 
            "wolverine", "apple", "banana", "galaxy", "force", 
            "blaster", "and", "chloe"]

query_string = "star wars"
query = {
    "match": {
        "body": {
            "query": query_string,
            "operator": "AND",
        }
    }
}
aggs = {
    "keywords": {
        "significant_text": {
            "field": "body",
            "script_heuristic": script_heuristic,
            "include": parsed_document,
        }
    }
}
alias="stackexchange"
resp = client.search(index=alias, query=query, aggs=aggs, size=0)
resp.body

{'took': 446,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 6829, 'relation': 'eq'},
  'max_score': None,
  'hits': []},
 'aggregations': {'keywords': {'doc_count': 6829,
   'bg_count': 1950545,
   'buckets': [{'key': 'luke',
     'doc_count': 1157,
     'score': 0.77982,
     'bg_count': 15452},
    {'key': 'force', 'doc_count': 1960, 'score': 0.76475, 'bg_count': 47672},
    {'key': 'darth', 'doc_count': 923, 'score': 0.73712, 'bg_count': 13985},
    {'key': 'vader', 'doc_count': 936, 'score': 0.72128, 'bg_count': 15980},
    {'key': 'leia', 'doc_count': 533, 'score': 0.70443, 'bg_count': 6048},
    {'key': 'galaxy', 'doc_count': 858, 'score': 0.64305, 'bg_count': 20692},
    {'key': 'blaster', 'doc_count': 211, 'score': 0.51115, 'bg_count': 2572},
    {'key': 'princess', 'doc_count': 225, 'score': 0.38521, 'bg_count': 6076},
    {'key': 'this', 'doc_count': 4136, 'score': 0.19193, 'bg_count': 927850},
    {'key': 'the',
     'doc_count': 6735,
     'score': 0.17519,
     'bg_count': 1801230}]}}}

Based on the Solr documentation and the code, it appears that in the Star Wars case, relatedness is scored by comparing the context filter for each token, if I understand correctly. Let me know if it’s worth exploring the Elasticsearch API further to implement the exact method for calculating relatedness.

treygrainger · 2024-09-14T14:38:50Z

That's really cool @lschneidpro ! I won't have time to review this probably for the next month (the book is being released and I'll be traveling to speak at a bunch of conferences), but I'll definitely add this to my list of things to review once I free up.

Out of curiosity, does this (or could it conceivably) handle the multi-level traversals (like the query disambiguation examples in chapter 7)?

I'd definitely be interested in getting code for this working for Elasticsearch and OpenSearch users. If you can get the multi-level aggregations working then and this could work consistently between Solr and Elasticsearch/Opensearch, I think there would be a lot of people interested.

lschneidpro · 2024-09-19T01:07:29Z

Hi @treygrainger,

Thanks for your feedback!

I'm about to go on vacation, so no worries. Currently, the implementation doesn't support multi-level traversals. To fully understand the functionalities, I'll need to dive deeper into the SKG academic paper and the SOLR code.

So far, I've been using Elasticsearch's Significant Terms Aggregation and Significant Text Aggregation. These compute foreground and background statistics based on the query, and I use a custom script (your SKG code) to derive a custom score. By the way, in my tests, the SOLR implementation runs faster than the Elasticsearch options.

I’m not an Elasticsearch expert, so I’m unsure how to implement SKG fully without developing a dedicated plugin. I can reach out to Elasticsearch support, or perhaps you or someone on your team with more Elasticsearch expertise could provide some guidance.

Here are the options I see moving forward:

Handle multi-level traversal on the client side using a similar approach.
Use another Elasticsearch API, such as Term Vectors, or something else I might be overlooking.
Develop a custom plugin, though this would require significant effort.

As for query disambiguation, I think sub-aggregations could work. The first aggregation level would target categories, while the second level would apply the classic significant text aggregation within each category bucket. I'll experiment with this when I'm back.

Best,

lschneidpro · 2025-01-08T23:50:16Z

@treygrainger any updates? Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SKG on elasticsearch #196

SKG on elasticsearch #196

lschneidpro commented Sep 9, 2024

lschneidpro commented Sep 9, 2024

treygrainger commented Sep 14, 2024 •

edited

Loading

lschneidpro commented Sep 19, 2024

lschneidpro commented Jan 8, 2025

SKG on elasticsearch #196

SKG on elasticsearch #196

Comments

lschneidpro commented Sep 9, 2024

lschneidpro commented Sep 9, 2024

treygrainger commented Sep 14, 2024 • edited Loading

lschneidpro commented Sep 19, 2024

lschneidpro commented Jan 8, 2025

treygrainger commented Sep 14, 2024 •

edited

Loading