-
-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🎉 Indicator semantic search #3628
Conversation
Quick links (staging server):
Login: chart-diff: ✅No charts for review.data-diff: ❌ Found differences= Dataset garden/who/2024-09-09/flu_test
= Table flu_test
~ Dim country
- - Removed values: 9 / 72627 (0.01%)
date country
2024-12-09 Canada
2024-11-25 Iran
2024-12-02 Mauritius
2024-12-09 Nepal
2024-11-25 Zambia
~ Dim date
- - Removed values: 9 / 72627 (0.01%)
country date
Canada 2024-12-09
Iran 2024-11-25
Mauritius 2024-12-02
Nepal 2024-12-09
Zambia 2024-11-25
~ Column denomcombined (changed data)
- - Removed values: 9 / 72627 (0.01%)
country date denomcombined
Canada 2024-12-09 31336
Iran 2024-11-25 2734
Mauritius 2024-12-02 80
Nepal 2024-12-09 87
Zambia 2024-11-25 20
~ Changed values: 34 / 72627 (0.05%)
country date denomcombined - denomcombined +
Bahrain 2024-12-02 113 112
Nigeria 2024-01-29 19 18
Nigeria 2024-04-15 6 9
Nigeria 2024-05-06 16 18
Nigeria 2024-09-30 22 24
~ Column pcnt_poscombined (changed data)
- - Removed values: 9 / 72627 (0.01%)
country date pcnt_poscombined
Canada 2024-12-09 3.724151
Iran 2024-11-25 21.360643
Mauritius 2024-12-02 6.25
Nepal 2024-12-09 6.896552
Zambia 2024-11-25 10.0
~ Changed values: 34 / 72627 (0.05%)
country date pcnt_poscombined - pcnt_poscombined +
Bahrain 2024-12-02 19.469027 19.642857
Nigeria 2024-01-29 5.263158 5.555555
Nigeria 2024-04-15 50.0 33.333332
Nigeria 2024-05-06 12.5 11.111111
Nigeria 2024-09-30 9.090909 8.333333
= Dataset garden/who/latest/monkeypox
= Table monkeypox
~ Dim country
- - Removed values: 3927 / 97705 (4.02%)
date country
2024-11-18 Canada
2024-11-23 Canada
2024-11-28 Europe
2023-05-29 Lebanon
2024-11-27 Zimbabwe
~ Dim date
- - Removed values: 3927 / 97705 (4.02%)
country date
Canada 2024-11-18
Canada 2024-11-23
Europe 2024-11-28
Lebanon 2023-05-29
Zimbabwe 2024-11-27
~ Column annotation (changed data)
- - Removed values: 3927 / 97705 (4.02%)
country date annotation
Canada 2024-11-18
Canada 2024-11-23
Europe 2024-11-28
Lebanon 2023-05-29
Zimbabwe 2024-11-27
~ Column iso_code (changed data)
- - Removed values: 3927 / 97705 (4.02%)
country date iso_code
Canada 2024-11-18 CAN
Canada 2024-11-23 CAN
Europe 2024-11-28 OWID_EUR
Lebanon 2023-05-29 LBN
Zimbabwe 2024-11-27 ZWE
~ Changed values: 3 / 97705 (0.00%)
country date iso_code - iso_code +
Burundi 2024-11-13 BDI NaN
Central African Republic 2024-11-13 CAF NaN
Liberia 2024-11-13 LBR NaN
~ Column new_cases (changed data)
- - Removed values: 3927 / 97705 (4.02%)
country date new_cases
Canada 2024-11-18 0
Canada 2024-11-23 0
Europe 2024-11-28 0
Lebanon 2023-05-29 0
Zimbabwe 2024-11-27 0
~ Changed values: 396 / 97705 (0.41%)
country date new_cases - new_cases +
Brazil 2023-09-26 11 12
Brazil 2024-08-27 82 81
World 2022-06-07 235 238
World 2022-11-08 794 795
World 2022-11-15 668 670
~ Column new_cases_per_million (changed data)
- - Removed values: 3927 / 97705 (4.02%)
country date new_cases_per_million
Canada 2024-11-18 0.0
Canada 2024-11-23 0.0
Europe 2024-11-28 0.0
Lebanon 2023-05-29 0.0
Zimbabwe 2024-11-27 0.0
~ Changed values: 342 / 97705 (0.35%)
country date new_cases_per_million - new_cases_per_million +
Africa 2024-03-24 0.046 0.057
Canada 2024-07-23 0.46 0.383
Central African Republic 2024-11-13 0.0 <NA>
United States 2024-09-17 0.129 0.12
United States 2024-09-24 0.167 0.137
~ Column new_cases_smoothed (changed data)
- - Removed values: 3927 / 97705 (4.02%)
country date new_cases_smoothed
Canada 2024-11-18 0.43
Canada 2024-11-23 0.0
Europe 2024-11-28 0.0
Lebanon 2023-05-29 0.0
Zimbabwe 2024-11-27 0.0
~ Changed values: 2425 / 97705 (2.48%)
country date new_cases_smoothed - new_cases_smoothed +
Canada 2024-10-23 1.57 0.0
North America 2022-11-04 63.139999 63.0
World 2022-06-21 262.429993 263.0
World 2022-07-06 535.710022 537.140015
World 2022-09-13 511.570007 510.709991
~ Column new_cases_smoothed_per_million (changed data)
- - Removed values: 3927 / 97705 (4.02%)
country date new_cases_smoothed_per_million
Canada 2024-11-18 0.011
Canada 2024-11-23 0.0
Europe 2024-11-28 0.0
Lebanon 2023-05-29 0.0
Zimbabwe 2024-11-27 0.0
~ Changed values: 1715 / 97705 (1.76%)
country date new_cases_smoothed_per_million - new_cases_smoothed_per_million +
Africa 2024-01-20 0.008 0.007
Canada 2022-07-12 0.486 0.49
Central African Republic 2024-08-03 0.217 0.241
Ghana 2024-10-29 0.004 0.0
North America 2024-09-25 0.018 0.012
~ Column new_deaths (changed data)
- - Removed values: 3927 / 97705 (4.02%)
country date new_deaths
Canada 2024-11-18 0
Canada 2024-11-23 0
Europe 2024-11-28 0
Lebanon 2023-05-29 0
Zimbabwe 2024-11-27 0
~ Changed values: 15 / 97705 (0.02%)
country date new_deaths - new_deaths +
Central African Republic 2024-11-13 0 <NA>
Liberia 2024-11-13 0 <NA>
South America 2022-07-19 1 0
South America 2022-09-20 2 1
World 2024-10-22 1 0
~ Column new_deaths_per_million (changed data)
- - Removed values: 3927 / 97705 (4.02%)
country date new_deaths_per_million
Canada 2024-11-18 0.0
Canada 2024-11-23 0.0
Europe 2024-11-28 0.0
Lebanon 2023-05-29 0.0
Zimbabwe 2024-11-27 0.0
~ Changed values: 15 / 97705 (0.02%)
country date new_deaths_per_million - new_deaths_per_million +
Central African Republic 2024-11-13 0.0 <NA>
Liberia 2024-11-13 0.0 <NA>
South America 2022-07-19 0.00229 0.0
South America 2022-09-20 0.00457 0.00229
World 2024-10-22 0.00012 0.0
~ Column new_deaths_smoothed (changed data)
- - Removed values: 3927 / 97705 (4.02%)
country date new_deaths_smoothed
Canada 2024-11-18 0.0
Canada 2024-11-23 0.0
Europe 2024-11-28 0.0
Lebanon 2023-05-29 0.0
Zimbabwe 2024-11-27 0.0
~ Changed values: 87 / 97705 (0.09%)
country date new_deaths_smoothed - new_deaths_smoothed +
Ecuador 2022-07-19 0.14 0.0
Ecuador 2022-09-24 0.14 0.0
South America 2022-07-22 0.14 0.0
South America 2023-06-28 0.14 0.0
World 2022-09-20 1.57 1.43
~ Column new_deaths_smoothed_per_million (changed data)
- - Removed values: 3927 / 97705 (4.02%)
country date new_deaths_smoothed_per_million
Canada 2024-11-18 0.0
Canada 2024-11-23 0.0
Europe 2024-11-28 0.0
Lebanon 2023-05-29 0.0
Zimbabwe 2024-11-27 0.0
~ Changed values: 87 / 97705 (0.09%)
country date new_deaths_smoothed_per_million - new_deaths_smoothed_per_million +
Ecuador 2022-07-19 0.00777 0.0
Ecuador 2022-09-24 0.00776 0.0
South America 2022-07-22 0.00032 0.0
South America 2023-06-28 0.00032 0.0
World 2022-09-20 0.0002 0.00018
~ Column total_cases (changed data)
- - Removed values: 3927 / 97705 (4.02%)
country date total_cases
Canada 2024-11-18 1836
Canada 2024-11-23 1836
Europe 2024-11-28 28064
Lebanon 2023-05-29 27
Zimbabwe 2024-11-27 2
~ Changed values: 6786 / 97705 (6.95%)
country date total_cases - total_cases +
Brazil 2023-01-03 10726 10707
Canada 2024-05-12 1533 1607
North America 2023-04-11 37001 37075
South America 2022-06-15 64 63
World 2023-04-30 87526 87582
~ Column total_cases_per_million (changed data)
- - Removed values: 3927 / 97705 (4.02%)
country date total_cases_per_million
Canada 2024-11-18 46.799
Canada 2024-11-23 46.793999
Europe 2024-11-28 37.787998
Lebanon 2023-05-29 5.031
Zimbabwe 2024-11-27 0.117
~ Changed values: 6611 / 97705 (6.77%)
country date total_cases_per_million - total_cases_per_million +
Guatemala 2023-06-08 22.35 22.405001
North America 2022-10-27 56.175999 56.298
North America 2023-07-05 61.807999 61.93
United States 2023-01-21 89.085999 89.083
United States 2024-01-13 93.624001 93.621002
~ Column total_deaths (changed data)
- - Removed values: 3927 / 97705 (4.02%)
country date total_deaths
Canada 2024-11-18 0
Canada 2024-11-23 0
Europe 2024-11-28 9
Lebanon 2023-05-29 0
Zimbabwe 2024-11-27 0
~ Changed values: 2536 / 97705 (2.60%)
country date total_deaths - total_deaths +
Ecuador 2024-10-22 3 0
South America 2022-10-08 23 21
World 2023-03-01 159 157
World 2024-01-31 189 186
World 2024-05-05 200 197
~ Column total_deaths_per_million (changed data)
- - Removed values: 3927 / 97705 (4.02%)
country date total_deaths_per_million
Canada 2024-11-18 0.0
Canada 2024-11-23 0.0
Europe 2024-11-28 0.01212
Lebanon 2023-05-29 0.0
Zimbabwe 2024-11-27 0.0
~ Changed values: 2536 / 97705 (2.60%)
country date total_deaths_per_million - total_deaths_per_million +
Ecuador 2024-10-22 0.16273 0.0
South America 2022-10-08 0.05256 0.04799
World 2023-03-01 0.01982 0.01957
World 2024-01-31 0.02337 0.023
World 2024-05-05 0.02467 0.0243
Legend: +New ~Modified -Removed =Identical Details
Hint: Run this locally with etl diff REMOTE data/ --include yourdataset --verbose --snippet Automatically updated datasets matching weekly_wildfires|excess_mortality|covid|fluid|flunet|country_profile|garden/ihme_gbd/2019/gbd_risk are not included Edited: 2024-12-19 11:23:46 UTC |
f9638c1
to
fe73af3
Compare
883079b
to
a2a4a18
Compare
7a36a37
to
c242521
Compare
c242521
to
3ca6c2f
Compare
I had to merge with master & solve couple of complaints from unit tests. Other than that, I tried running the
(from terminal) Is this expected? |
Yes, it's expected. It takes a while to calculate embeddings for all indicators (a couple of hundred thousands). It's better to test it on staging unless you want to develop it further. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! let's try it and see how it behaves
* ✨ Speed up initial loading of insight search
Prototype for indicator semantic search. This was originally motivated as a proof of concept that
EmbeddingsModel
can handle all indicators in memory, and we don't have to resort to more complex infrastructure (like faiss for in-memory similarity search or a vector database like milvus). This turned out to be fine since similarity search takes an acceptable <1s.Whether semantic search over indicators is that useful is questionable. It seems that the lowest hanging fruit lies in improving the existing search in Admin. This is why I've decided to park it for now and didn't spend too much time making it pretty and really accurate. Still, it's a good showcase of what a semantic search has to offer and is worth merging in my opinion.
Embeddings are created from indicator title and description (concatenated). It works decently, but there are some issues (see problematic queries at the end). There are a couple of knobs to make it prettier (well, the UI is still crap... feedback would be much appreciated!). I plan to create a demo and let others know about it in January.
How to review
Running this locally would have to calculate embeddings for all indicators, which could be time-consuming. I recommend to check out the app on staging server and try a few queries.
Problematic queries