Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎉 Indicator semantic search #3628

Merged
merged 13 commits into from
Dec 27, 2024
Merged

🎉 Indicator semantic search #3628

merged 13 commits into from
Dec 27, 2024

Conversation

Marigold
Copy link
Collaborator

@Marigold Marigold commented Nov 26, 2024

Prototype for indicator semantic search. This was originally motivated as a proof of concept that EmbeddingsModel can handle all indicators in memory, and we don't have to resort to more complex infrastructure (like faiss for in-memory similarity search or a vector database like milvus). This turned out to be fine since similarity search takes an acceptable <1s.

Whether semantic search over indicators is that useful is questionable. It seems that the lowest hanging fruit lies in improving the existing search in Admin. This is why I've decided to park it for now and didn't spend too much time making it pretty and really accurate. Still, it's a good showcase of what a semantic search has to offer and is worth merging in my opinion.

Embeddings are created from indicator title and description (concatenated). It works decently, but there are some issues (see problematic queries at the end). There are a couple of knobs to make it prettier (well, the UI is still crap... feedback would be much appreciated!). I plan to create a demo and let others know about it in January.

How to review

Running this locally would have to calculate embeddings for all indicators, which could be time-consuming. I recommend to check out the app on staging server and try a few queries.

Problematic queries

  • Query "beer" returns "Coffee" as the most similar indicator. That's because indicators about beer are often long, especially when concatenated with description, which has tons of irrelevant information. This could be possibly fixed by "summarizing" title & description into a "fixed" length representation.

@owidbot
Copy link
Contributor

owidbot commented Nov 26, 2024

Quick links (staging server):

Site Dev Site Preview Admin Wizard Docs

Login: ssh owid@staging-site-indicator-search

chart-diff: ✅ No charts for review.
data-diff: ❌ Found differences
= Dataset garden/who/2024-09-09/flu_test
  = Table flu_test
    ~ Dim country
-       - Removed values: 9 / 72627 (0.01%)
                date   country
          2024-12-09    Canada
          2024-11-25      Iran
          2024-12-02 Mauritius
          2024-12-09     Nepal
          2024-11-25    Zambia
    ~ Dim date
-       - Removed values: 9 / 72627 (0.01%)
            country       date
             Canada 2024-12-09
               Iran 2024-11-25
          Mauritius 2024-12-02
              Nepal 2024-12-09
             Zambia 2024-11-25
    ~ Column denomcombined (changed data)
-       - Removed values: 9 / 72627 (0.01%)
            country       date  denomcombined
             Canada 2024-12-09          31336
               Iran 2024-11-25           2734
          Mauritius 2024-12-02             80
              Nepal 2024-12-09             87
             Zambia 2024-11-25             20
        ~ Changed values: 34 / 72627 (0.05%)
          country       date  denomcombined -  denomcombined +
          Bahrain 2024-12-02              113              112
          Nigeria 2024-01-29               19               18
          Nigeria 2024-04-15                6                9
          Nigeria 2024-05-06               16               18
          Nigeria 2024-09-30               22               24
    ~ Column pcnt_poscombined (changed data)
-       - Removed values: 9 / 72627 (0.01%)
            country       date  pcnt_poscombined
             Canada 2024-12-09          3.724151
               Iran 2024-11-25         21.360643
          Mauritius 2024-12-02              6.25
              Nepal 2024-12-09          6.896552
             Zambia 2024-11-25              10.0
        ~ Changed values: 34 / 72627 (0.05%)
          country       date  pcnt_poscombined -  pcnt_poscombined +
          Bahrain 2024-12-02           19.469027           19.642857
          Nigeria 2024-01-29            5.263158            5.555555
          Nigeria 2024-04-15                50.0           33.333332
          Nigeria 2024-05-06                12.5           11.111111
          Nigeria 2024-09-30            9.090909            8.333333
= Dataset garden/who/latest/monkeypox
  = Table monkeypox
    ~ Dim country
-       - Removed values: 3927 / 97705 (4.02%)
                date  country
          2024-11-18   Canada
          2024-11-23   Canada
          2024-11-28   Europe
          2023-05-29  Lebanon
          2024-11-27 Zimbabwe
    ~ Dim date
-       - Removed values: 3927 / 97705 (4.02%)
           country       date
            Canada 2024-11-18
            Canada 2024-11-23
            Europe 2024-11-28
           Lebanon 2023-05-29
          Zimbabwe 2024-11-27
    ~ Column annotation (changed data)
-       - Removed values: 3927 / 97705 (4.02%)
           country       date annotation
            Canada 2024-11-18           
            Canada 2024-11-23           
            Europe 2024-11-28           
           Lebanon 2023-05-29           
          Zimbabwe 2024-11-27
    ~ Column iso_code (changed data)
-       - Removed values: 3927 / 97705 (4.02%)
           country       date iso_code
            Canada 2024-11-18      CAN
            Canada 2024-11-23      CAN
            Europe 2024-11-28 OWID_EUR
           Lebanon 2023-05-29      LBN
          Zimbabwe 2024-11-27      ZWE
        ~ Changed values: 3 / 97705 (0.00%)
                           country       date iso_code - iso_code +
                           Burundi 2024-11-13        BDI        NaN
          Central African Republic 2024-11-13        CAF        NaN
                           Liberia 2024-11-13        LBR        NaN
    ~ Column new_cases (changed data)
-       - Removed values: 3927 / 97705 (4.02%)
           country       date  new_cases
            Canada 2024-11-18          0
            Canada 2024-11-23          0
            Europe 2024-11-28          0
           Lebanon 2023-05-29          0
          Zimbabwe 2024-11-27          0
        ~ Changed values: 396 / 97705 (0.41%)
          country       date  new_cases -  new_cases +
           Brazil 2023-09-26           11           12
           Brazil 2024-08-27           82           81
            World 2022-06-07          235          238
            World 2022-11-08          794          795
            World 2022-11-15          668          670
    ~ Column new_cases_per_million (changed data)
-       - Removed values: 3927 / 97705 (4.02%)
           country       date  new_cases_per_million
            Canada 2024-11-18                    0.0
            Canada 2024-11-23                    0.0
            Europe 2024-11-28                    0.0
           Lebanon 2023-05-29                    0.0
          Zimbabwe 2024-11-27                    0.0
        ~ Changed values: 342 / 97705 (0.35%)
                           country       date  new_cases_per_million -  new_cases_per_million +
                            Africa 2024-03-24                    0.046                    0.057
                            Canada 2024-07-23                     0.46                    0.383
          Central African Republic 2024-11-13                      0.0                     <NA>
                     United States 2024-09-17                    0.129                     0.12
                     United States 2024-09-24                    0.167                    0.137
    ~ Column new_cases_smoothed (changed data)
-       - Removed values: 3927 / 97705 (4.02%)
           country       date  new_cases_smoothed
            Canada 2024-11-18                0.43
            Canada 2024-11-23                 0.0
            Europe 2024-11-28                 0.0
           Lebanon 2023-05-29                 0.0
          Zimbabwe 2024-11-27                 0.0
        ~ Changed values: 2425 / 97705 (2.48%)
                country       date  new_cases_smoothed -  new_cases_smoothed +
                 Canada 2024-10-23                  1.57                   0.0
          North America 2022-11-04             63.139999                  63.0
                  World 2022-06-21            262.429993                 263.0
                  World 2022-07-06            535.710022            537.140015
                  World 2022-09-13            511.570007            510.709991
    ~ Column new_cases_smoothed_per_million (changed data)
-       - Removed values: 3927 / 97705 (4.02%)
           country       date  new_cases_smoothed_per_million
            Canada 2024-11-18                           0.011
            Canada 2024-11-23                             0.0
            Europe 2024-11-28                             0.0
           Lebanon 2023-05-29                             0.0
          Zimbabwe 2024-11-27                             0.0
        ~ Changed values: 1715 / 97705 (1.76%)
                           country       date  new_cases_smoothed_per_million -  new_cases_smoothed_per_million +
                            Africa 2024-01-20                             0.008                             0.007
                            Canada 2022-07-12                             0.486                              0.49
          Central African Republic 2024-08-03                             0.217                             0.241
                             Ghana 2024-10-29                             0.004                               0.0
                     North America 2024-09-25                             0.018                             0.012
    ~ Column new_deaths (changed data)
-       - Removed values: 3927 / 97705 (4.02%)
           country       date  new_deaths
            Canada 2024-11-18           0
            Canada 2024-11-23           0
            Europe 2024-11-28           0
           Lebanon 2023-05-29           0
          Zimbabwe 2024-11-27           0
        ~ Changed values: 15 / 97705 (0.02%)
                           country       date  new_deaths -  new_deaths +
          Central African Republic 2024-11-13             0          <NA>
                           Liberia 2024-11-13             0          <NA>
                     South America 2022-07-19             1             0
                     South America 2022-09-20             2             1
                             World 2024-10-22             1             0
    ~ Column new_deaths_per_million (changed data)
-       - Removed values: 3927 / 97705 (4.02%)
           country       date  new_deaths_per_million
            Canada 2024-11-18                     0.0
            Canada 2024-11-23                     0.0
            Europe 2024-11-28                     0.0
           Lebanon 2023-05-29                     0.0
          Zimbabwe 2024-11-27                     0.0
        ~ Changed values: 15 / 97705 (0.02%)
                           country       date  new_deaths_per_million -  new_deaths_per_million +
          Central African Republic 2024-11-13                       0.0                      <NA>
                           Liberia 2024-11-13                       0.0                      <NA>
                     South America 2022-07-19                   0.00229                       0.0
                     South America 2022-09-20                   0.00457                   0.00229
                             World 2024-10-22                   0.00012                       0.0
    ~ Column new_deaths_smoothed (changed data)
-       - Removed values: 3927 / 97705 (4.02%)
           country       date  new_deaths_smoothed
            Canada 2024-11-18                  0.0
            Canada 2024-11-23                  0.0
            Europe 2024-11-28                  0.0
           Lebanon 2023-05-29                  0.0
          Zimbabwe 2024-11-27                  0.0
        ~ Changed values: 87 / 97705 (0.09%)
                country       date  new_deaths_smoothed -  new_deaths_smoothed +
                Ecuador 2022-07-19                   0.14                    0.0
                Ecuador 2022-09-24                   0.14                    0.0
          South America 2022-07-22                   0.14                    0.0
          South America 2023-06-28                   0.14                    0.0
                  World 2022-09-20                   1.57                   1.43
    ~ Column new_deaths_smoothed_per_million (changed data)
-       - Removed values: 3927 / 97705 (4.02%)
           country       date  new_deaths_smoothed_per_million
            Canada 2024-11-18                              0.0
            Canada 2024-11-23                              0.0
            Europe 2024-11-28                              0.0
           Lebanon 2023-05-29                              0.0
          Zimbabwe 2024-11-27                              0.0
        ~ Changed values: 87 / 97705 (0.09%)
                country       date  new_deaths_smoothed_per_million -  new_deaths_smoothed_per_million +
                Ecuador 2022-07-19                            0.00777                                0.0
                Ecuador 2022-09-24                            0.00776                                0.0
          South America 2022-07-22                            0.00032                                0.0
          South America 2023-06-28                            0.00032                                0.0
                  World 2022-09-20                             0.0002                            0.00018
    ~ Column total_cases (changed data)
-       - Removed values: 3927 / 97705 (4.02%)
           country       date  total_cases
            Canada 2024-11-18         1836
            Canada 2024-11-23         1836
            Europe 2024-11-28        28064
           Lebanon 2023-05-29           27
          Zimbabwe 2024-11-27            2
        ~ Changed values: 6786 / 97705 (6.95%)
                country       date  total_cases -  total_cases +
                 Brazil 2023-01-03          10726          10707
                 Canada 2024-05-12           1533           1607
          North America 2023-04-11          37001          37075
          South America 2022-06-15             64             63
                  World 2023-04-30          87526          87582
    ~ Column total_cases_per_million (changed data)
-       - Removed values: 3927 / 97705 (4.02%)
           country       date  total_cases_per_million
            Canada 2024-11-18                   46.799
            Canada 2024-11-23                46.793999
            Europe 2024-11-28                37.787998
           Lebanon 2023-05-29                    5.031
          Zimbabwe 2024-11-27                    0.117
        ~ Changed values: 6611 / 97705 (6.77%)
                country       date  total_cases_per_million -  total_cases_per_million +
              Guatemala 2023-06-08                      22.35                  22.405001
          North America 2022-10-27                  56.175999                     56.298
          North America 2023-07-05                  61.807999                      61.93
          United States 2023-01-21                  89.085999                     89.083
          United States 2024-01-13                  93.624001                  93.621002
    ~ Column total_deaths (changed data)
-       - Removed values: 3927 / 97705 (4.02%)
           country       date  total_deaths
            Canada 2024-11-18             0
            Canada 2024-11-23             0
            Europe 2024-11-28             9
           Lebanon 2023-05-29             0
          Zimbabwe 2024-11-27             0
        ~ Changed values: 2536 / 97705 (2.60%)
                country       date  total_deaths -  total_deaths +
                Ecuador 2024-10-22               3               0
          South America 2022-10-08              23              21
                  World 2023-03-01             159             157
                  World 2024-01-31             189             186
                  World 2024-05-05             200             197
    ~ Column total_deaths_per_million (changed data)
-       - Removed values: 3927 / 97705 (4.02%)
           country       date  total_deaths_per_million
            Canada 2024-11-18                       0.0
            Canada 2024-11-23                       0.0
            Europe 2024-11-28                   0.01212
           Lebanon 2023-05-29                       0.0
          Zimbabwe 2024-11-27                       0.0
        ~ Changed values: 2536 / 97705 (2.60%)
                country       date  total_deaths_per_million -  total_deaths_per_million +
                Ecuador 2024-10-22                     0.16273                         0.0
          South America 2022-10-08                     0.05256                     0.04799
                  World 2023-03-01                     0.01982                     0.01957
                  World 2024-01-31                     0.02337                       0.023
                  World 2024-05-05                     0.02467                      0.0243


Legend: +New  ~Modified  -Removed  =Identical  Details
Hint: Run this locally with etl diff REMOTE data/ --include yourdataset --verbose --snippet

Automatically updated datasets matching weekly_wildfires|excess_mortality|covid|fluid|flunet|country_profile|garden/ihme_gbd/2019/gbd_risk are not included

Edited: 2024-12-19 11:23:46 UTC
Execution time: 17.41 seconds

@Marigold Marigold force-pushed the indicator-search branch 3 times, most recently from 7a36a37 to c242521 Compare December 19, 2024 05:56
@Marigold Marigold marked this pull request as ready for review December 19, 2024 05:56
@Marigold Marigold changed the title 🎉 Indicator search 🎉 Indicator semantic search Dec 19, 2024
@Marigold Marigold requested a review from lucasrodes December 19, 2024 06:14
@lucasrodes
Copy link
Member

I had to merge with master & solve couple of complaints from unit tests.

Other than that, I tried running the Indicator search page locally, and it is taking a while (~1hour) to generate embeddings:

Batches:   3%|█                                   | 168/5620 [01:46<1:05:10,  1.39it/s]

(from terminal)

Is this expected?

@Marigold
Copy link
Collaborator Author

Yes, it's expected. It takes a while to calculate embeddings for all indicators (a couple of hundred thousands). It's better to test it on staging unless you want to develop it further.

Copy link
Member

@lucasrodes lucasrodes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! let's try it and see how it behaves

@Marigold Marigold merged commit 72f08ba into master Dec 27, 2024
8 checks passed
@Marigold Marigold deleted the indicator-search branch December 27, 2024 08:43
antea04 pushed a commit that referenced this pull request Feb 5, 2025
* ✨ Speed up initial loading of insight search
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants