Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find Annotation and Knowledge Graphs to integrate #41

Open
josiahseaman opened this issue Apr 20, 2020 · 10 comments
Open

Find Annotation and Knowledge Graphs to integrate #41

josiahseaman opened this issue Apr 20, 2020 · 10 comments
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers question Further information is requested

Comments

@josiahseaman
Copy link
Member

josiahseaman commented Apr 20, 2020

Assignee: Ali Haider Bangash
The first step is to identify what data could be integrated through a knowledge graph and what is available. What did the other Hackathon teams accomplish? What is available? Information goes in this issue. We're looking for information that relates to genetic variants of the virus:

  • Structural annotations channel. Protein structure => codon table => sequence position. We could mark up pangenome positions related to known protein variants
  • Gene Annotations: Possibly only need the reference gene annotation GFF, but it would be nice to have these positions in the graph genome context. @subwaystation has ensured we have coordinate transforms that go both ways pangenome <-> reference genome coordinates, using faldo in RDF.
  • Clinical data: Possibly the most important. If we have any knowledge of patient outcomes, and what region they're from, we could connect a strain of the virus (which will contain variants) to a patient outcome: how long in hospital, how long on ventilator, etc. We don't necessarily need a viral sequence from that specific individual, but at minimum a probable association with a variant.
    • Human DNA variation data could also be used as in UK Biobank article.
    • Technically, annotated a complete human pangenome is beyond our current scope in that gigabase genomes will put strain on our pipeline. It may be possible, however to make local graphs of key regions like HLA or MHC inside the Human genome.
  • Phylogenetics: We're going to have a phylogenetic tree eventually Phylogenetic Tree Visualization  Schematize#58. It'd be nice to link this with the "country" and "town" concepts in the knowledge graph. What geographic or transmission data could we bring in?
@josiahseaman josiahseaman added documentation Improvements or additions to documentation good first issue Good for newcomers question Further information is requested labels Apr 20, 2020
@hhaider15
Copy link

  • Clinical data
    South Korea's CoVid 2019 patients 5 Year patient history The government of the Republic of Korea decided to share the world’s first de-identified COVID-19 nationwide patient data with domestic and international researchers. The data sets are collected and processed promptly, thanks to the Korean National Health Insurance System, covering the entire population across the nation.

@hhaider15
Copy link

Structural annotations: Very well done by Machine learning working group- Complete genomes of the strains: labelled with the respective source & its metadata

@hhaider15
Copy link

@subwaystation
Copy link
Member

Hi @hhaider15 !
Thanks for all the links. We could work with e.g. .csv or .fasta.

But what we had in mind are SparqlEndpoints which we could query using SPARQL.

I think a good start would be http://yummydata.org/. And maybe you will finde some endpoints which are not listed there ;)
Please come back to me, if you have more questions.

@subwaystation
Copy link
Member

@josiahseaman and Phylogenetics: As far as I got it from the #public_sequence_resource group, they will pack the metadata also into a SPARQL endpoint. Part of the metadata will be a mandatory field for collection_location. For the list of the required metadata please visit https://github.com/arvados/bh20-seq-resource/blob/master/example/minimal_example.yaml.

@innamoratika
Copy link

Ali- Just wanted to introduce myself post-convo with @josiahseaman : I'll be working on the phylo side of things and we should touch base at some point regarding using universal IDs for genomes. We should have enough in the phylo tree that we can track provenance and pass that on to you!

@hhaider15
Copy link

Agreed. Apologies I was busy earlier. Shall be working on this, now.

@hhaider15
Copy link

Good to see you @innamoratika

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants