Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update subsampling #1074

Merged
merged 4 commits into from
Jun 17, 2023
Merged

Update subsampling #1074

merged 4 commits into from
Jun 17, 2023

Conversation

trvrb
Copy link
Member

@trvrb trvrb commented Jun 15, 2023

Description of proposed changes

In the current "global" analyses, treating China and India each as just another country in Asia was resulting in much smaller per-capita sampling rates relative to most other countries. For example, in the current gisaid/global/6m tree we have 66 viruses from Guatemala (population 17M), 62 viruses from Costa Rica (population 5M), 18 viruses from India (population 1400M) and 21 viruses from China (population 1400M). This is a ~1000-fold difference in per-capita sampling intensity.

This PR partially addresses this issue by splitting out China and India into their own buckets when subsampling in the global build targets. This results in buckets of North America (580M), South America (420M), Europe (750M), Africa (1.2B), Oceania (44M), India (1.4B), China (1.4B) and Asia minus India and China (1.8B). Additionally, this commit makes a small correction to reduce Oceania to 20% region count relative to other regions from previous 33%.

Within the builds that focus on region=Asia there is currently less intensive per-capita sampling in China and India relative to other countries in Asia. For example, the current gisaid/asia/6m tree has 144 viruses from China (population 1.4B), 96 viruses from India (population 1.4B), 118 viruses from Thailand (population 70M) and 53 viruses from Laos (population 7M). This a 100-fold difference in sampling intensity between Laos and India.

This PR splits Asia-focused builds to have 4 geographic buckets rather than the previous 2, arriving at China, India, Asia (minus China and India) and global context. This won't fully address differential per-capita sampling intensity in Asia, but is a simple addition that should go a long way.

Testing

Trial runs for this PR exist at:

You can see that the global map looks improved:

Screen Shot 2023-06-14 at 5 19 00 PM

where the 6m build has 412 viruses from China and 417 viruses from India out of a total of 3197.

If we compare regions here, we have:

  • North America: 516 in 580M = 0.9 per million
  • South America: 534 in 420M = 1.3 per million
  • Europe: 499 in 750M = 0.7 per million
  • Africa: 377 in 1.2B = 0.3 per million
  • Oceania: 116 in 44M = 2.6 per million
  • Asia minus China and India: 326 in 1.8B = 0.2 per million
  • China: 412 in 1.4B = 0.3 per million
  • India: 417 in 1.4B = 0.3 per million

The builds.yaml is asking for more viruses from Asia than China or India and so lower counts here should be reflective of lack of available sequences.

The map from the asia build also looks improved:

Screen Shot 2023-06-14 at 5 30 40 PM

with 897 viruses from China, 688 viruses from India and 1076 viruses from else in Asia.

Release checklist

If this pull request introduces new features, complete the following steps:

  • Update docs/src/reference/change_log.md in this pull request to document these changes by the date they were added.

In the current "global" analyses, treating China and India each as just another country in Asia was resulting in much smaller per-capita sampling rates. For example, in the current gisaid/global/6m tree we have 66 viruses from Guatemala (population 17M), 62 viruses from Costa Rica (population 5M), 18 viruses from India (population 1400M) and 21 viruses from China (population 1400M). This is a ~1000-fold difference in per-capita sampling intensity.

This commit partially addresses this issue by splitting out China and India into their own buckets when subsampling. This results in buckets of North America (580M), South America (420M), Europe (750M), Africa (1.2B), Oceania (44M), India (1.4B), China (1.4B) and Asia minus India and China (1.8B).

Additionally, this commit makes a small correction to reduce Oceania to 20% region count relative to other regions from previous 33%.
Within the builds that focus on region=Asia there is currently less intensive per-capita sampling in China and India relative to other countries in Asia. For example, the current gisaid/asia/6m tree has 144 viruses from China (population 1.4B), 96 viruses from India (population 1.4B), 118 viruses from Thailand (population 70M) and 53 viruses from Laos (population 7M). This a 100-fold difference in sampling intensity between Laos and India. 

This commit splits Asia-focused builds to have 4 geographic buckets rather than the previous 2, arriving at China, India, Asia (minus China and India) and global context.

This won't fully address differential per-capita sampling intensity in Asia, but is a simple addition that should go a long way.
@trvrb
Copy link
Member Author

trvrb commented Jun 15, 2023

@corneliusroemer: Let me know your opinion here. I don't think I'd like to make this anymore complicated than what the PR has done to break out China and India. However, if you think it's better to further balance region target numbers from the basically even targets across regions, I could certainly do so.

This would likely slightly down weight North America and South America, further downweight Oceania and upweight Africa, Asia, China and India.

This commit slight updates targets for the `nextstrain_global` subsampling schemes in an attempt to bring realized per-capita sample counts more in line with population size basis.
@trvrb
Copy link
Member Author

trvrb commented Jun 15, 2023

Actually... I decided to go ahead and tweak the regional target max_sequences to try to bring the per-capita sample counts in the nextstrain_global subsampling schemes to be more similar across regions. I've kicked off a trial build that should populate https://nextstrain.org/staging/ncov/gisaid/trial/update-subsampling-v2/global/6m shortly.

@trvrb
Copy link
Member Author

trvrb commented Jun 16, 2023

Here's the global map from https://nextstrain.org/staging/ncov/gisaid/trial/update-subsampling-v2/global/6m:

Screen Shot 2023-06-15 at 5 50 14 PM

If we compare regions here, we have:

  • North America: 433 in 580M = 0.7 per million
  • South America: 404 in 420M = 1.0 per million
  • Europe: 491 in 750M = 0.7 per million
  • Africa: 461 in 1.2B = 0.4 per million
  • Oceania: 56 in 44M = 1.3 per million
  • Asia minus China and India: 677 in 1.8B = 0.4 per million
  • China: 723 in 1.4B = 0.5 per million
  • India: 588 in 1.4B = 0.4 per million

I think I'm really pretty happy with this at this point, but of course let me know if you think otherwise.

@corneliusroemer
Copy link
Member

Ah this looks very good @trvrb. There's still scope for improvement (Brazil/Russia/Indonesia/Malaysia too small), but it's much closer to ideal than anything we've had in the past.

Copy link
Member

@corneliusroemer corneliusroemer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great result! It's a lot of code, that must have been a lot of effort.
I hope I can offer a much simpler population weighted sampling script in the near future, been thinking about it a lot but haven't started coding.

@trvrb
Copy link
Member Author

trvrb commented Jun 16, 2023

Thanks @corneliusroemer! (Code was basically copy paste after tuning one example) I noticed Brazil and Indonesia as well. And we have small countries with lots of admin 1 divisions larger than they should be also (like Costa Rica with 33 samples in 5M for 6.6 per million). I agree to stop here with this specific PR. I would really like to consider more systematic population weighting in the future via augur subsample or the like. A bonus would be if it could be more performant than repeatedly running augur filter as we do currently.

@trvrb trvrb merged commit 7b1ed00 into master Jun 17, 2023
@trvrb trvrb deleted the update-subsampling branch June 17, 2023 20:00
@victorlin victorlin mentioned this pull request Aug 14, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

2 participants