Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How we can use our training set? #24

Open
Siavash-cloud opened this issue Jan 6, 2022 · 3 comments
Open

How we can use our training set? #24

Siavash-cloud opened this issue Jan 6, 2022 · 3 comments

Comments

@Siavash-cloud
Copy link

Siavash-cloud commented Jan 6, 2022

Hello @arvkevi,
Thank you for providing this software.
I wonder how we can use another training set (instead of the 1000 genome) in your software?
Regards,
Siavash

@arvkevi
Copy link
Owner

arvkevi commented Jan 7, 2022

Hi @Siavash-cloud thank you for checking out the repo. Great suggestion, do you know if gnomAD has sample level ancestry data available? Or is there another data source you were thinking about?

@Siavash-cloud
Copy link
Author

Siavash-cloud commented Jan 7, 2022

Hi @arvkevi ,
Thanks for your reply.
Yes, you also can add the HGDP dataset (https://www.internationalgenome.org/data-portal/data-collection/hgdp) to your default reference panel (training set). However, I meant if you can consider it as an option (people or companies that want to use their private training set/ people or companies that want to use public data set) in your software. As you may know, the 1000 genomes+HDP have a limited data set for some populations (for example Colombian, n=7) which can affect the prediction accuracy. Additionally, your software might be used for other organisms (such as horses) if you put the above-mentioned option in your software. Also, I wonder if I can use another list of SNPs instead of Kidd et al. 2014 and Kosoy et al. 2009 SNP lists in your software? To the best of my knowledge, Kidd et al. 2014 selected these SNPs based on fixation index (Fst) of SNPs in their dataset (limited number of populations, limited number of individuals per population) which using those SNPs can decrease the predictive ability in ancestry prediction of other populations (populations that are not included in 1000 genomes or HDP).
Regards,
Siavash

@arvkevi
Copy link
Owner

arvkevi commented Jan 8, 2022

I think it would take a bit of work to incorporate additional (or custom) reference genomes. But it's probably worth pursuing, I'll take a look at what it would take to get that functionality in.

Users can use custom snps by using the build-model command.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants