Use weighted sampling for Asia builds #1106

victorlin · 2024-05-03T23:09:45Z

tracked by #1141

Description of proposed changes

This replaces the Asia/China/India split with population-based weighted sampling (possible with nextstrain/augur#1454).

Previews

open/asia 1m, 2m, 6m, all-time
- NOTE: new builds blocked on Add geolocation rules for new Korea strains ncov-ingest#469. These links are for the last successful build.
gisaid/asia 1m, 2m, 6m, all-time

Analysis

I think a good comparison is gisaid/asia/all-time before and after weighted sampling. Some notes from that comparison:

With weighted sampling, more sequences are included for large countries such as Indonesia which did not get the same treatment as China and India prior to this PR.
With weighted sampling, there is a high chance that no sequences are included for small countries such as Maldives.
The sampling difference results in some subtle but maybe significant differences in the frequencies panel.

Questions for reviewers

What can be done to check that the sampling results are appropriate?
✅ Is it a good idea to weight on population size? I was thinking per-country case counts over time might be better, but not sure where to get that data. (population size yes, case counts no)

Notable comment threads

Weight using case count data? No
Where to define the weights file? config.subsampling.<sampling_scheme>.<sample>.group_by_weights
Use UN data source? Yes

Checklist

Sampling results look ok? Yes
Drop 4e5ee67
Merge+release Implement weighted sampling augur#1454
Add Augur version info on 7a3cb76
Post-merge: update minimum Augur version on Define workflow dependencies #1118

Release checklist

If this pull request introduces new features, complete the following steps:

Update docs/src/reference/change_log.md in this pull request to document these changes by the date they were added.

victorlin · 2024-05-08T17:42:40Z

Not ready to be merged but opening for initial review and feedback to shape implementation in nextstrain/augur#1454.

nextstrain_profiles/nextstrain-open/builds.yaml

joverlee521 · 2024-05-08T23:02:39Z

I was thinking per-country case counts over time might be better, but not sure where to get that data.

We ingest country case counts from OWID in forecasts-ncov and upload them to s3://nextstrain-data/files/workflows/forecasts-ncov/cases/global.tsv.gz.

workflow/snakemake_rules/main_workflow.smk

trvrb · 2024-06-06T00:42:09Z

Is it a good idea to weight on population size? I was thinking per-country case counts over time might be better, but not sure where to get that data.

Case counts no longer mean anything and haven't really meant anything since early 2022. We can see this today in the parsed case counts file country_week_weights.tsv where Afghanistan with a population of 41M reported 638 cases in the week of (2024, 20), while Brunei with a population of 500k reported 175 cases in this same time period. This is 1.5 cases per 100k in Afganistran and 35 cases per 100k in Brunei. This 23-fold difference is going to be almost entirely reporting rate differences.

Similarly, South Korea has 135,331 cases listed in your TSV for (2024,20). This is 265 cases per 100k and so a 176-fold difference in reporting rate relative to Afghanistan.

If you do subsampling based on reported cases it will strongly bias included samples to (wealthier) countries with better surveillance.

For ongoing ncov builds, we should be weighting on population size. I wouldn't worry about admin division population size and would just worry about country-level population sizes in terms of weighting.

I'd start by rolling back 6b6cd2b. Ah! I see this was already discussed here #1106 (comment). Can you update PR with what I should be working from for review?

trvrb · 2024-06-06T00:50:26Z

Following from this, I worked from 1afb6d7 and tried:

python3 scripts/get_population_sizes.py --output data/country_population_weights.tsv

from a freshly updated nextstrain shell command and got the following error

Traceback (most recent call last):
  File "/nextstrain/build/scripts/get_population_sizes.py", line 67, in <module>
    export_imf(args.output)
  File "/nextstrain/build/scripts/get_population_sizes.py", line 13, in export_imf
    workbook = xlrd.open_workbook(file_contents=excel_contents, ignore_workbook_corruption=True)
TypeError: open_workbook() got an unexpected keyword argument 'ignore_workbook_corruption'

and if I drop ignore_workbook_corruption=True in line 13 of the script I get an error about a corrupted workbook:

xlrd.compdoc.CompDocError: Workbook corruption: seen[2] == 4

https://www.cia.gov/the-world-factbook/field/population/country-comparison/#:~:text=DOWNLOAD-,DATA,-Rank looks potentially cleaner.

Another suggestion here: my preference for this sort of thing is to version the data. It's much easier to think about and it's not something that needs to be updated day-to-day. I'd just run a script to prepare country_population_weights.tsv once and not have it be part of the workflow. I'd version this to defaults/country_population_weights.tsv. This particular file is very similar to defaults/mutational_fitness_distance_map. In that case I had used scripts/developer_scripts/parse_mutational_fitness_tsv_into_distance_map, with the developer_scripts/ placement to signal that this is not part of the regular workflow.

victorlin · 2024-06-06T22:50:50Z

I've dropped 6b6cd2b and updated f038890 to use "the world factbook" as source of population data. Also moved the script and data files into version control (great suggestion).

I'm going to run this through the trial builds to check if the updated data source contains the right country names.

victorlin · 2024-06-07T23:58:33Z

Cleaned up the commits and updated trial build links in the PR description. Everything is still draft with a few lingering FIXMEs in the changes. Next week I plan to shift back to nextstrain/augur#1454 and tweak the implementation logic before coming back here.

nextstrain_profiles/nextstrain-gisaid/builds.yaml

new changes

huddlej

This looks pretty good to me, @victorlin! Regarding your question:

What can be done to check that the sampling results are appropriate?

The main pattern I'm looking for is that the pie chart sizes for Asian countries in the map view for the GISAID builds are proportional to the actual country populations. The all-time GISAID map view looks consistent with the population sizes. For example, Indonesia has more samples than Japan and Malaysia but fewer than China and India. The Philippines and Vietnam have similar numbers of samples.

Regarding the workflow implementation, I really like how simple you've made this new option to group by weights. My only minor comment would be that additional documentation could help developers and users better understand how to generate these weights and use them in the workflow. For developers, a small section in the ncov README could be enough. For users, an entry in the config reference guide would be most useful.

Altogether, this is excellent, though! Thank you for the hard work you've put into pushing this along over many many months!

nextstrain_profiles/nextstrain-gisaid/builds.yaml

scripts/developer_scripts/get_population_weights

defaults/population_weights.tsv

trvrb · 2024-08-02T23:58:54Z

I really like where this is going! However, I ran into an error when trying to run this locally. In this filter step:

augur filter --metadata nextstrain-ncov-private/100k/metadata.tsv.xz
  --include defaults/include.txt
  --exclude defaults/exclude.txt
  --max-date 6M
  --exclude-where 'region!=Asia'
  --group-by country year month
  --group-by-weights defaults/population_weights.tsv
  --subsample-max-sequences 700
  --output-strains results/asia_6m/sample-asia_early.txt
WARNING: Asked to provide at most 700 sequences, but there are 1601 groups.
Sampling with weights defined by defaults/population_weights.tsv.
NOTE: Skipping 101242 groups due to lack of entries in metadata.
NOTE: Weights were not provided for the columns 'year', 'month'. Using equal weights across values in those columns.
ERROR: 27 groups appear in the metadata but are missing from the weights file. Re-run with --output-group-by-missing-weights to continue.

If I then run with adding --output-group-by-missing-weights missing-weights.txt it completes successfully with standard out:

WARNING: Asked to provide at most 700 sequences, but there are 1601 groups.
Sampling with weights defined by defaults/population_weights.tsv.
NOTE: Skipping 101242 groups due to lack of entries in metadata.
NOTE: Weights were not provided for the columns 'month', 'year'. Using equal weights across values in those columns.
WARNING: 27 groups appear in the metadata but are missing from the weights file. Sequences from these groups will be dropped.
All missing groups added to 'missing-weights.txt'.
91656 strains were dropped during filtering
	69341 were dropped because of 'region!=Asia'
	4791 were dropped because they were later than 2024.09 or missing a date
	1 was added back because it was in defaults/include.txt
	17525 were dropped because of subsampling criteria
695 strains passed all filters

The missing weights file looks like:

country	year	month	weight
Myanmar	2020	(2020, 4)	
Myanmar	2020	(2020, 8)	
Myanmar	2021	(2021, 1)

I can see that Myanmar is missing from population_weights.tsv. Two things:

I think that ncov needs to have --output-group-by-missing-weights
Double check entries in population_weights.tsv to make sure all countries in our metadata are present.

I think the behavior off erroring out is maybe not the best, but I can discuss this over in nextstrain/augur#1454.

victorlin · 2024-08-05T21:11:07Z

@trvrb thanks for taking it for a spin, sorry that you ran into another error 😞

the behavior off erroring out is maybe not the best, but I can discuss this over in nextstrain/augur#1454.

I think appropriate to discuss here since your scenario is a good example.

The erroring behavior is what I intended. The weights file should be comprehensive, otherwise sequences get dropped due to lack of weights.

For this PR, that means population_weights.tsv should have weights for everything under region=Asia. I assumed running rebuild-gisaid.yml would be sufficient to test this, but that didn't include sequences from Myanmar (not sure why, I'll investigate separately).

The proper fix here is to update the country name mapping in get_population_weights with "Burma": "Myanmar" (your 2nd point).

I think that ncov needs to have --output-group-by-missing-weights

I think differently. --output-group-by-missing-weights is meant more as an immediate workaround rather than something to use by default. If the ncov workflow uses this option by default, it's less likely that we'd catch any discrepancies since WARNINGs in output are not something we go back and check regularly.

Example scenario: right now we know population_weights.tsv is comprehensive, but sometime in the future new sequences are added under a country name that differs from what's in population_weights.tsv. This would still produce "successful" automated builds and one would only notice that those new sequences are being dropped if the Snakemake logs are inspected.

victorlin · 2024-08-06T18:08:05Z

I assumed running rebuild-gisaid.yml would be sufficient to test this, but that didn't include sequences from Myanmar (not sure why, I'll investigate separately).

I re-ran rebuild-gisaid.yml and it failed with the same error. I suspect that I ran the last trial build before pushing error handling updates so it silently succeeded by dropping all Myanmar sequences.

The error message has been improved:

ERROR: The input metadata contains these values under the following columns that are not covered by 'defaults/population_weights.tsv':
- 'country': ['Myanmar']
Re-run with --output-group-by-missing-weights to continue.

This confirms that Myanmar is the only country missing from population_weights.tsv. I'll push a fix to add Myanmar weight, re-run trial build, and update preview links in the PR description.

trvrb · 2024-08-08T00:38:20Z

Thanks for the feedback @victorlin. I have a couple thoughts:

In decent fraction of scenarios, I believe I'm going to want a "default" value for population_weights.tsv. As a simple example, I might want to specify --group-by country and --group-by-weights weights.tsv with entries of

country weight
USA     100
default 1

where I don't want to have to semi-manually make sure to include every possible country as 1, just to be able to specify that I want sampling more intensively from the USA than other countries. In many situations you'll have a category like host where you won't be sure what exactly will come through and you can't enumerate ahead of time. If we were to add weighted sampling here https://next.nextstrain.org/avian-flu/h5n1-cattle-outbreak/genome we'd run into random new hosts continually breaking the build. I'd want a simple default to avoid this erroring.

However, I do totally see why you'd want to require an explicit default value rather than having an implicit default weight for entries in metadata that aren't present in weights.tsv.

It feels a little strange to tie "don't error out on missing values" to --output-group-by-missing-weights. They seem semantically distinct to me. By giving me the option of not erroring out by including --output-group-by-missing-weights missing-weights.txt I wanted to always include this option so that the build wouldn't error out (rather than as a one-off strategy to find and fix missing values).

My suggestion would be to just drop --output-group-by-missing-weights as an option. If there are missing values they should always be printed. I really like:

ERROR: The input metadata contains these values under the following columns that are not covered by 'defaults/population_weights.tsv':
- 'country': ['Myanmar']

This also slightly slims down the very long list of augur filter command line arguments. There's already a bunch of output- options. This also stops the user from having to run the command again with an additional argument.

So, scenarios would be

Missing values in weights.tsv with no default set – Error out, print to standard out the missing values that need to be specified
Missing values in weights.tsv with default set – Proceed, but still print to standard out the missing values

victorlin · 2024-08-13T22:44:31Z

@trvrb thanks for the example use case and suggestion – both were insightful. I've implemented in the Augur PR as nextstrain/augur@a00d3b5.

For the example here, the scenarios now look like:

ERROR: The input metadata contains these values under the following columns that are not covered by 'defaults/population_weights.tsv':
- 'country': ['Myanmar']
To fix this, either:
(1) specify weights explicitly - add entries to 'defaults/population_weights.tsv' for the values above, or
(2) specify a default weight - add an entry to 'defaults/population_weights.tsv' with the value 'default' for all columns

WARNING: The input metadata contains these values under the following columns that are not directly covered by 'defaults/population_weights.tsv':
- 'country': ['Myanmar']
The default weight of 1 will be used for all groups defined by those values.
...

trvrb

Really nice work here. Thanks for humoring me with the requested changes. This looks good to me from ncov perspective. I'd suggest a trial build just to make sure nothing is broken. And then maybe doing a second PR to extend weighted sampling to other regions?

victorlin · 2024-08-14T18:04:52Z

I'd suggest a trial build just to make sure nothing is broken.

Rebuilding the docker image to run trial builds now. Once that's done, I'll update the links in the PR description. I'll also take a look and note if anything seems off compared to the previous run.

And then maybe doing a second PR to extend weighted sampling to other regions?

I've created an issue to track rollout of weighted sampling in this workflow: #1141

To be used for weighted sampling in a future commit. I considered the following data sources: - World Bank Data <https://data.worldbank.org/indicator/SP.POP.TOTL> - IMF <https://www.imf.org/external/datamapper/NGDP_RPCH@WEO/OEMDC/ADVEC/WEOWORLD> - CIA The World Factbook <https://www.cia.gov/the-world-factbook/field/population/country-comparison/> - United Nations World Population Prospects <https://population.un.org/wpp/Download/Standard/CSV/> The UN data seemed to be the most comprehensive and easy to use.

This replaces the Asia/China/India split with population-based weighted sampling (possible in Augur version 25.3.0). This requires changing the geographical grouping resolution from division to country, but I assume it was only grouped by division in an attempt to have varying group sizes per country, and that population-based weighting is an acceptable replacement.

victorlin self-assigned this May 3, 2024

victorlin mentioned this pull request May 7, 2024

Implement weighted sampling nextstrain/augur#1454

Merged

12 tasks

victorlin marked this pull request as ready for review May 8, 2024 17:42

genehack previously approved these changes May 8, 2024

View reviewed changes

nextstrain_profiles/nextstrain-open/builds.yaml Show resolved Hide resolved

jameshadfield marked this pull request as draft May 8, 2024 22:08

victorlin commented May 10, 2024

View reviewed changes

workflow/snakemake_rules/main_workflow.smk Outdated Show resolved Hide resolved

victorlin force-pushed the victorlin/weighted-sampling branch 2 times, most recently from 5d2c691 to 6b6cd2b Compare May 10, 2024 18:45

victorlin changed the title ~~Use population-based weighted sampling for Asia builds~~ Use weighted sampling for Asia builds May 10, 2024

victorlin mentioned this pull request May 23, 2024

Generate subsampling config with a script #1102

Closed

5 tasks

victorlin force-pushed the victorlin/weighted-sampling branch from 6b6cd2b to 256a1ff Compare June 6, 2024 22:46

victorlin force-pushed the victorlin/weighted-sampling branch from dedef9d to c3a78e9 Compare June 7, 2024 23:56

joverlee521 mentioned this pull request Jun 12, 2024

ci.yaml: Pin pathogen-repo-ci #1115

Merged

joverlee521 mentioned this pull request Jun 26, 2024

Use pathogen-repo-build in GH Action workflows #1119

Merged

7 tasks

victorlin force-pushed the victorlin/weighted-sampling branch 2 times, most recently from ae7b62e to 40d2994 Compare July 2, 2024 00:19

victorlin commented Jul 2, 2024

View reviewed changes

nextstrain_profiles/nextstrain-gisaid/builds.yaml Show resolved Hide resolved

victorlin mentioned this pull request Jul 8, 2024

Improved subsampling support nextstrain/augur#1481

Open

6 tasks

victorlin force-pushed the victorlin/weighted-sampling branch from 40d2994 to 4e5ee67 Compare July 17, 2024 19:48

victorlin marked this pull request as ready for review July 17, 2024 19:50

victorlin requested review from huddlej, trvrb and jameshadfield July 17, 2024 19:50

genehack approved these changes Jul 18, 2024

View reviewed changes

huddlej reviewed Aug 1, 2024

View reviewed changes

nextstrain_profiles/nextstrain-gisaid/builds.yaml Show resolved Hide resolved

scripts/developer_scripts/get_population_weights Outdated Show resolved Hide resolved

defaults/population_weights.tsv Outdated Show resolved Hide resolved

victorlin force-pushed the victorlin/weighted-sampling branch 3 times, most recently from 2f02c59 to 6c49c89 Compare August 7, 2024 18:23

trvrb approved these changes Aug 14, 2024

View reviewed changes

victorlin mentioned this pull request Aug 14, 2024

Use weighted sampling #1141

Closed

4 tasks

victorlin mentioned this pull request Aug 14, 2024

Add geolocation rules for new Korea strains nextstrain/ncov-ingest#469

Closed

victorlin added 3 commits August 22, 2024 11:12

Update changelog

dc5433f

victorlin force-pushed the victorlin/weighted-sampling branch from aac80c2 to dc5433f Compare August 22, 2024 18:14

victorlin mentioned this pull request Aug 22, 2024

CI failing with Augur 25.3.0 #1146

Closed

victorlin merged commit e69c37e into master Aug 22, 2024
6 of 7 checks passed

victorlin deleted the victorlin/weighted-sampling branch August 22, 2024 18:36

victorlin mentioned this pull request Aug 26, 2024

Fix Asia weighted sampling #1150

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use weighted sampling for Asia builds #1106

Use weighted sampling for Asia builds #1106

victorlin commented May 3, 2024 •

edited

Loading

victorlin commented May 8, 2024

joverlee521 commented May 8, 2024

trvrb commented Jun 6, 2024 •

edited

Loading

trvrb commented Jun 6, 2024 •

edited

Loading

victorlin commented Jun 6, 2024

victorlin commented Jun 7, 2024

huddlej left a comment

trvrb commented Aug 2, 2024

victorlin commented Aug 5, 2024

victorlin commented Aug 6, 2024

trvrb commented Aug 8, 2024 •

edited

Loading

victorlin commented Aug 13, 2024

trvrb left a comment

victorlin commented Aug 14, 2024

Use weighted sampling for Asia builds #1106

Use weighted sampling for Asia builds #1106

Conversation

victorlin commented May 3, 2024 • edited Loading

Description of proposed changes

Previews

Analysis

Questions for reviewers

Notable comment threads

Checklist

Release checklist

victorlin commented May 8, 2024

joverlee521 commented May 8, 2024

trvrb commented Jun 6, 2024 • edited Loading

trvrb commented Jun 6, 2024 • edited Loading

victorlin commented Jun 6, 2024

victorlin commented Jun 7, 2024

huddlej left a comment

Choose a reason for hiding this comment

trvrb commented Aug 2, 2024

victorlin commented Aug 5, 2024

victorlin commented Aug 6, 2024

trvrb commented Aug 8, 2024 • edited Loading

victorlin commented Aug 13, 2024

trvrb left a comment

Choose a reason for hiding this comment

victorlin commented Aug 14, 2024

victorlin commented May 3, 2024 •

edited

Loading

trvrb commented Jun 6, 2024 •

edited

Loading

trvrb commented Jun 6, 2024 •

edited

Loading

trvrb commented Aug 8, 2024 •

edited

Loading