Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pull requests for new data #8

Open
tavareshugo opened this issue Apr 23, 2019 · 4 comments
Open

pull requests for new data #8

tavareshugo opened this issue Apr 23, 2019 · 4 comments

Comments

@tavareshugo
Copy link

Dear @kbroman ,

I understand this repo is mostly for example data and for illustrating your package functionality.

However, are you accepting pull requests for new data? Or do you have any views in terms of sharing cross2 datasets?

(I'm asking as I've just parsed some genotypes from Cui et al 2012 and was wondering whether to dump them in a new repo or as a pull request here)

I can see an advantage of having it centralised, in that it makes it easy for the community to find already parsed genotypes. The disadvantage, I guess, is that it potentially adds lots of work for you...

@kbroman
Copy link
Member

kbroman commented Apr 23, 2019

This repository is already getting a bit too big, so it's probably best not to add to it. I'd recommend creating a separate repo or putting files on zenodo.org or figshare.com. I'd be happy to link to it in the README for the present repo.

Gary Churchill had started a QTL archive to encourage data sharing, but it ended up getting merged into the Mouse Phenome Database.

Access to a variety of published QTL datasets would be really valuable though, particularly if the metadata were compiled in a way that could be easily queried, and if the data sets were all in a common form, ready for analysis.

@kbroman kbroman closed this as completed Apr 23, 2019
@tavareshugo
Copy link
Author

Yes, I can see that the long-term sustainability is not ideal with regards to size.
Would be great indeed to have a centralised way to fetch data from the community. Perhaps using something like datastorr or piggyback would be one way to go.

In the meanwhile, I've just started a parallel qtl2data for data that I happen to parse. :)

@kbroman kbroman reopened this Apr 28, 2019
@kbroman
Copy link
Member

kbroman commented Apr 28, 2019

I've thought about this some more, and do think it'd be great to further compile QTL mapping datasets.

We'd like something like dbGAP, but I'm thinking a distributed system that could be run at no cost, because it's hard to get funding for this sort of thing.

Each dataset in a separate github repository, or just sitting at Zenodo, and then a github repository containing just a CSV file that has one row for each dataset, with basic summary information plus links to references and to the data.

@tavareshugo
Copy link
Author

that's a nice idea 👍
Simple and relatively low maintenance.

Maybe having a guide of how to prepare and distribute the data. E.g. provide as a zip file, or individual files, or RDS file? How to name crosses (organism name, publication name, etc...), what metadata to provide (organism, number of individuals, number/type of markers, publication DOI, type of cross, etc...)? things like that.

With time, maybe one could actually include such a table as a data.frame object with the qtl2 package, so users can access it from within the package. So one could so something like:

# qtl2data is data.frame with information about available crosses
download.file(qtl2data$link_to_cross2[qtl2data$cross_id == "DO_Gatti2014"])

Or just have a wrapper function like download_cross2 that does just that.

Also, could then have a test unit that checks validity of all the files, in case some files disappear with time (users might delete repositories or servers go down, etc). So it would be easy to notify the authors and remove the package from the CSV until fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants