-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use weighted sampling for Asia builds #1106
Conversation
Not ready to be merged but opening for initial review and feedback to shape implementation in nextstrain/augur#1454. |
We ingest country case counts from OWID in forecasts-ncov and upload them to |
5d2c691
to
6b6cd2b
Compare
Case counts no longer mean anything and haven't really meant anything since early 2022. We can see this today in the parsed case counts file Similarly, South Korea has 135,331 cases listed in your TSV for (2024,20). This is 265 cases per 100k and so a 176-fold difference in reporting rate relative to Afghanistan. If you do subsampling based on reported cases it will strongly bias included samples to (wealthier) countries with better surveillance. For ongoing ncov builds, we should be weighting on population size. I wouldn't worry about admin division population size and would just worry about country-level population sizes in terms of weighting. I'd start by rolling back 6b6cd2b. Ah! I see this was already discussed here #1106 (comment). Can you update PR with what I should be working from for review? |
Following from this, I worked from 1afb6d7 and tried:
from a freshly updated
and if I drop
https://www.cia.gov/the-world-factbook/field/population/country-comparison/#:~:text=DOWNLOAD-,DATA,-Rank looks potentially cleaner. Another suggestion here: my preference for this sort of thing is to version the data. It's much easier to think about and it's not something that needs to be updated day-to-day. I'd just run a script to prepare |
6b6cd2b
to
256a1ff
Compare
dedef9d
to
c3a78e9
Compare
Cleaned up the commits and updated trial build links in the PR description. Everything is still draft with a few lingering FIXMEs in the changes. Next week I plan to shift back to nextstrain/augur#1454 and tweak the implementation logic before coming back here. |
ae7b62e
to
40d2994
Compare
40d2994
to
4e5ee67
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good to me, @victorlin! Regarding your question:
What can be done to check that the sampling results are appropriate?
The main pattern I'm looking for is that the pie chart sizes for Asian countries in the map view for the GISAID builds are proportional to the actual country populations. The all-time GISAID map view looks consistent with the population sizes. For example, Indonesia has more samples than Japan and Malaysia but fewer than China and India. The Philippines and Vietnam have similar numbers of samples.
Regarding the workflow implementation, I really like how simple you've made this new option to group by weights. My only minor comment would be that additional documentation could help developers and users better understand how to generate these weights and use them in the workflow. For developers, a small section in the ncov README could be enough. For users, an entry in the config reference guide would be most useful.
Altogether, this is excellent, though! Thank you for the hard work you've put into pushing this along over many many months!
I really like where this is going! However, I ran into an error when trying to run this locally. In this filter step:
If I then run with adding
The missing weights file looks like:
I can see that Myanmar is missing from
I think the behavior off erroring out is maybe not the best, but I can discuss this over in nextstrain/augur#1454. |
@trvrb thanks for taking it for a spin, sorry that you ran into another error 😞
I think appropriate to discuss here since your scenario is a good example. The erroring behavior is what I intended. The weights file should be comprehensive, otherwise sequences get dropped due to lack of weights. For this PR, that means The proper fix here is to update the country name mapping in get_population_weights with
I think differently. Example scenario: right now we know |
I re-ran rebuild-gisaid.yml and it failed with the same error. I suspect that I ran the last trial build before pushing error handling updates so it silently succeeded by dropping all Myanmar sequences. The error message has been improved:
This confirms that Myanmar is the only country missing from |
2f02c59
to
6c49c89
Compare
Thanks for the feedback @victorlin. I have a couple thoughts:
where I don't want to have to semi-manually make sure to include every possible country as 1, just to be able to specify that I want sampling more intensively from the USA than other countries. In many situations you'll have a category like However, I do totally see why you'd want to require an explicit
My suggestion would be to just drop
This also slightly slims down the very long list of So, scenarios would be
|
@trvrb thanks for the example use case and suggestion – both were insightful. I've implemented in the Augur PR as nextstrain/augur@a00d3b5. For the example here, the scenarios now look like:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice work here. Thanks for humoring me with the requested changes. This looks good to me from ncov perspective. I'd suggest a trial build just to make sure nothing is broken. And then maybe doing a second PR to extend weighted sampling to other regions?
Rebuilding the docker image to run trial builds now. Once that's done, I'll update the links in the PR description. I'll also take a look and note if anything seems off compared to the previous run.
I've created an issue to track rollout of weighted sampling in this workflow: #1141 |
To be used for weighted sampling in a future commit. I considered the following data sources: - World Bank Data <https://data.worldbank.org/indicator/SP.POP.TOTL> - IMF <https://www.imf.org/external/datamapper/NGDP_RPCH@WEO/OEMDC/ADVEC/WEOWORLD> - CIA The World Factbook <https://www.cia.gov/the-world-factbook/field/population/country-comparison/> - United Nations World Population Prospects <https://population.un.org/wpp/Download/Standard/CSV/> The UN data seemed to be the most comprehensive and easy to use.
This replaces the Asia/China/India split with population-based weighted sampling (possible in Augur version 25.3.0). This requires changing the geographical grouping resolution from division to country, but I assume it was only grouped by division in an attempt to have varying group sizes per country, and that population-based weighting is an acceptable replacement.
aac80c2
to
dc5433f
Compare
tracked by #1141
Description of proposed changes
This replaces the Asia/China/India split with population-based weighted sampling (possible with nextstrain/augur#1454).
Previews
Korea
strains ncov-ingest#469. These links are for the last successful build.Analysis
I think a good comparison is gisaid/asia/all-time before and after weighted sampling. Some notes from that comparison:
Questions for reviewers
Notable comment threads
config.subsampling.<sampling_scheme>.<sample>.group_by_weights
Checklist
Release checklist
If this pull request introduces new features, complete the following steps:
docs/src/reference/change_log.md
in this pull request to document these changes by the date they were added.