-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update subsampling #1074
Update subsampling #1074
Conversation
In the current "global" analyses, treating China and India each as just another country in Asia was resulting in much smaller per-capita sampling rates. For example, in the current gisaid/global/6m tree we have 66 viruses from Guatemala (population 17M), 62 viruses from Costa Rica (population 5M), 18 viruses from India (population 1400M) and 21 viruses from China (population 1400M). This is a ~1000-fold difference in per-capita sampling intensity. This commit partially addresses this issue by splitting out China and India into their own buckets when subsampling. This results in buckets of North America (580M), South America (420M), Europe (750M), Africa (1.2B), Oceania (44M), India (1.4B), China (1.4B) and Asia minus India and China (1.8B). Additionally, this commit makes a small correction to reduce Oceania to 20% region count relative to other regions from previous 33%.
Within the builds that focus on region=Asia there is currently less intensive per-capita sampling in China and India relative to other countries in Asia. For example, the current gisaid/asia/6m tree has 144 viruses from China (population 1.4B), 96 viruses from India (population 1.4B), 118 viruses from Thailand (population 70M) and 53 viruses from Laos (population 7M). This a 100-fold difference in sampling intensity between Laos and India. This commit splits Asia-focused builds to have 4 geographic buckets rather than the previous 2, arriving at China, India, Asia (minus China and India) and global context. This won't fully address differential per-capita sampling intensity in Asia, but is a simple addition that should go a long way.
@corneliusroemer: Let me know your opinion here. I don't think I'd like to make this anymore complicated than what the PR has done to break out China and India. However, if you think it's better to further balance region target numbers from the basically even targets across regions, I could certainly do so. This would likely slightly down weight North America and South America, further downweight Oceania and upweight Africa, Asia, China and India. |
This commit slight updates targets for the `nextstrain_global` subsampling schemes in an attempt to bring realized per-capita sample counts more in line with population size basis.
Actually... I decided to go ahead and tweak the regional target |
Here's the global map from https://nextstrain.org/staging/ncov/gisaid/trial/update-subsampling-v2/global/6m: If we compare regions here, we have:
I think I'm really pretty happy with this at this point, but of course let me know if you think otherwise. |
Ah this looks very good @trvrb. There's still scope for improvement (Brazil/Russia/Indonesia/Malaysia too small), but it's much closer to ideal than anything we've had in the past. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great result! It's a lot of code, that must have been a lot of effort.
I hope I can offer a much simpler population weighted sampling script in the near future, been thinking about it a lot but haven't started coding.
Thanks @corneliusroemer! (Code was basically copy paste after tuning one example) I noticed Brazil and Indonesia as well. And we have small countries with lots of admin 1 divisions larger than they should be also (like Costa Rica with 33 samples in 5M for 6.6 per million). I agree to stop here with this specific PR. I would really like to consider more systematic population weighting in the future via |
Description of proposed changes
In the current "global" analyses, treating China and India each as just another country in Asia was resulting in much smaller per-capita sampling rates relative to most other countries. For example, in the current gisaid/global/6m tree we have 66 viruses from Guatemala (population 17M), 62 viruses from Costa Rica (population 5M), 18 viruses from India (population 1400M) and 21 viruses from China (population 1400M). This is a ~1000-fold difference in per-capita sampling intensity.
This PR partially addresses this issue by splitting out China and India into their own buckets when subsampling in the
global
build targets. This results in buckets of North America (580M), South America (420M), Europe (750M), Africa (1.2B), Oceania (44M), India (1.4B), China (1.4B) and Asia minus India and China (1.8B). Additionally, this commit makes a small correction to reduce Oceania to 20% region count relative to other regions from previous 33%.Within the builds that focus on
region=Asia
there is currently less intensive per-capita sampling in China and India relative to other countries in Asia. For example, the current gisaid/asia/6m tree has 144 viruses from China (population 1.4B), 96 viruses from India (population 1.4B), 118 viruses from Thailand (population 70M) and 53 viruses from Laos (population 7M). This a 100-fold difference in sampling intensity between Laos and India.This PR splits Asia-focused builds to have 4 geographic buckets rather than the previous 2, arriving at China, India, Asia (minus China and India) and global context. This won't fully address differential per-capita sampling intensity in Asia, but is a simple addition that should go a long way.
Testing
Trial runs for this PR exist at:
You can see that the global map looks improved:
where the 6m build has 412 viruses from China and 417 viruses from India out of a total of 3197.
If we compare regions here, we have:
The
builds.yaml
is asking for more viruses from Asia than China or India and so lower counts here should be reflective of lack of available sequences.The map from the
asia
build also looks improved:with 897 viruses from China, 688 viruses from India and 1076 viruses from else in Asia.
Release checklist
If this pull request introduces new features, complete the following steps:
docs/src/reference/change_log.md
in this pull request to document these changes by the date they were added.