Update subsampling #1074

trvrb · 2023-06-15T00:32:44Z

Description of proposed changes

In the current "global" analyses, treating China and India each as just another country in Asia was resulting in much smaller per-capita sampling rates relative to most other countries. For example, in the current gisaid/global/6m tree we have 66 viruses from Guatemala (population 17M), 62 viruses from Costa Rica (population 5M), 18 viruses from India (population 1400M) and 21 viruses from China (population 1400M). This is a ~1000-fold difference in per-capita sampling intensity.

This PR partially addresses this issue by splitting out China and India into their own buckets when subsampling in the global build targets. This results in buckets of North America (580M), South America (420M), Europe (750M), Africa (1.2B), Oceania (44M), India (1.4B), China (1.4B) and Asia minus India and China (1.8B). Additionally, this commit makes a small correction to reduce Oceania to 20% region count relative to other regions from previous 33%.

Within the builds that focus on region=Asia there is currently less intensive per-capita sampling in China and India relative to other countries in Asia. For example, the current gisaid/asia/6m tree has 144 viruses from China (population 1.4B), 96 viruses from India (population 1.4B), 118 viruses from Thailand (population 70M) and 53 viruses from Laos (population 7M). This a 100-fold difference in sampling intensity between Laos and India.

This PR splits Asia-focused builds to have 4 geographic buckets rather than the previous 2, arriving at China, India, Asia (minus China and India) and global context. This won't fully address differential per-capita sampling intensity in Asia, but is a simple addition that should go a long way.

Testing

Trial runs for this PR exist at:

You can see that the global map looks improved:

where the 6m build has 412 viruses from China and 417 viruses from India out of a total of 3197.

If we compare regions here, we have:

North America: 516 in 580M = 0.9 per million
South America: 534 in 420M = 1.3 per million
Europe: 499 in 750M = 0.7 per million
Africa: 377 in 1.2B = 0.3 per million
Oceania: 116 in 44M = 2.6 per million
Asia minus China and India: 326 in 1.8B = 0.2 per million
China: 412 in 1.4B = 0.3 per million
India: 417 in 1.4B = 0.3 per million

The builds.yaml is asking for more viruses from Asia than China or India and so lower counts here should be reflective of lack of available sequences.

The map from the asia build also looks improved:

with 897 viruses from China, 688 viruses from India and 1076 viruses from else in Asia.

Release checklist

If this pull request introduces new features, complete the following steps:

Update docs/src/reference/change_log.md in this pull request to document these changes by the date they were added.

In the current "global" analyses, treating China and India each as just another country in Asia was resulting in much smaller per-capita sampling rates. For example, in the current gisaid/global/6m tree we have 66 viruses from Guatemala (population 17M), 62 viruses from Costa Rica (population 5M), 18 viruses from India (population 1400M) and 21 viruses from China (population 1400M). This is a ~1000-fold difference in per-capita sampling intensity. This commit partially addresses this issue by splitting out China and India into their own buckets when subsampling. This results in buckets of North America (580M), South America (420M), Europe (750M), Africa (1.2B), Oceania (44M), India (1.4B), China (1.4B) and Asia minus India and China (1.8B). Additionally, this commit makes a small correction to reduce Oceania to 20% region count relative to other regions from previous 33%.

Within the builds that focus on region=Asia there is currently less intensive per-capita sampling in China and India relative to other countries in Asia. For example, the current gisaid/asia/6m tree has 144 viruses from China (population 1.4B), 96 viruses from India (population 1.4B), 118 viruses from Thailand (population 70M) and 53 viruses from Laos (population 7M). This a 100-fold difference in sampling intensity between Laos and India. This commit splits Asia-focused builds to have 4 geographic buckets rather than the previous 2, arriving at China, India, Asia (minus China and India) and global context. This won't fully address differential per-capita sampling intensity in Asia, but is a simple addition that should go a long way.

trvrb · 2023-06-15T00:36:29Z

@corneliusroemer: Let me know your opinion here. I don't think I'd like to make this anymore complicated than what the PR has done to break out China and India. However, if you think it's better to further balance region target numbers from the basically even targets across regions, I could certainly do so.

This would likely slightly down weight North America and South America, further downweight Oceania and upweight Africa, Asia, China and India.

This commit slight updates targets for the `nextstrain_global` subsampling schemes in an attempt to bring realized per-capita sample counts more in line with population size basis.

trvrb · 2023-06-15T19:15:31Z

Actually... I decided to go ahead and tweak the regional target max_sequences to try to bring the per-capita sample counts in the nextstrain_global subsampling schemes to be more similar across regions. I've kicked off a trial build that should populate https://nextstrain.org/staging/ncov/gisaid/trial/update-subsampling-v2/global/6m shortly.

trvrb · 2023-06-16T00:58:00Z

Here's the global map from https://nextstrain.org/staging/ncov/gisaid/trial/update-subsampling-v2/global/6m:

If we compare regions here, we have:

North America: 433 in 580M = 0.7 per million
South America: 404 in 420M = 1.0 per million
Europe: 491 in 750M = 0.7 per million
Africa: 461 in 1.2B = 0.4 per million
Oceania: 56 in 44M = 1.3 per million
Asia minus China and India: 677 in 1.8B = 0.4 per million
China: 723 in 1.4B = 0.5 per million
India: 588 in 1.4B = 0.4 per million

I think I'm really pretty happy with this at this point, but of course let me know if you think otherwise.

corneliusroemer · 2023-06-16T01:01:30Z

Ah this looks very good @trvrb. There's still scope for improvement (Brazil/Russia/Indonesia/Malaysia too small), but it's much closer to ideal than anything we've had in the past.

corneliusroemer

Great result! It's a lot of code, that must have been a lot of effort.
I hope I can offer a much simpler population weighted sampling script in the near future, been thinking about it a lot but haven't started coding.

trvrb · 2023-06-16T01:08:44Z

Thanks @corneliusroemer! (Code was basically copy paste after tuning one example) I noticed Brazil and Indonesia as well. And we have small countries with lots of admin 1 divisions larger than they should be also (like Costa Rica with 33 samples in 5M for 6.6 per million). I agree to stop here with this specific PR. I would really like to consider more systematic population weighting in the future via augur subsample or the like. A bonus would be if it could be more performant than repeatedly running augur filter as we do currently.

trvrb added 2 commits June 14, 2023 15:55

trvrb requested a review from corneliusroemer June 15, 2023 00:32

Tweak global subsampling targets

0da5ed3

This commit slight updates targets for the `nextstrain_global` subsampling schemes in an attempt to bring realized per-capita sample counts more in line with population size basis.

corneliusroemer approved these changes Jun 16, 2023

View reviewed changes

Update change log

bbc9b0a

trvrb merged commit 7b1ed00 into master Jun 17, 2023

trvrb deleted the update-subsampling branch June 17, 2023 20:00

victorlin mentioned this pull request Sep 19, 2023

Allow weighted subsampling nextstrain/augur#1318

Closed

5 tasks

victorlin mentioned this pull request Aug 14, 2024

Use weighted sampling #1141

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update subsampling #1074

Update subsampling #1074

trvrb commented Jun 15, 2023

trvrb commented Jun 15, 2023 •

edited

Loading

trvrb commented Jun 15, 2023

trvrb commented Jun 16, 2023

corneliusroemer commented Jun 16, 2023

corneliusroemer left a comment

trvrb commented Jun 16, 2023

Update subsampling #1074

Update subsampling #1074

Conversation

trvrb commented Jun 15, 2023

Description of proposed changes

Testing

Release checklist

trvrb commented Jun 15, 2023 • edited Loading

trvrb commented Jun 15, 2023

trvrb commented Jun 16, 2023

corneliusroemer commented Jun 16, 2023

corneliusroemer left a comment

Choose a reason for hiding this comment

trvrb commented Jun 16, 2023

trvrb commented Jun 15, 2023 •

edited

Loading