Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Feature: simplify curation of annotation transcripts #374

Open
roland-ewald opened this issue Nov 29, 2017 · 15 comments
Open

New Feature: simplify curation of annotation transcripts #374

roland-ewald opened this issue Nov 29, 2017 · 15 comments
Milestone

Comments

@roland-ewald
Copy link
Contributor

Problem

The current strategy of serializing JannovarData makes it hard to manually edit the available transcript data. For example, to improve annotation speed and memory footprint one could imagine to only use a certain subset of the default transcript sets, or to include certain historic (but now out-dated) transcript versions for validation and comparison to previous data sets. One could also imagine mixing transcripts from different sources, e.g. RefSeq and Ensembl, to give a more complete picture (though I am not sure Jannovar can deal with this).

As a nice side effect, this should make integration-testing Jannovar much easier and faster, as it now only requires a text-based representation of the transcript data for a handful of transcripts: these can be converted to a JannovarData object without downloading additional files, and Jannovar's output for variants in these regions should be the same as with more transcripts.

Potential Solution

I suggest to add two new commands to jannovar-cli:

  • export: reads a given *.ser file and writes a text-based representation of all included data into a file (e.g. JSON, XML, or a set of CSV files).

  • import: reads the files in the same format as they are produced by export and generates a serialized JannovarData file from that, which can be used for annotation.

I'm currently working on code that does this in an ad-hoc manner and would be happy to contribute a PR for this, as this could be useful in many other settings as well. I am open to suggestions regarding the specific design (file format, how to name the commands, etc).

@roland-ewald roland-ewald changed the title New Feature: simplify curation annotation transcripts New Feature: simplify curation of annotation transcripts Nov 29, 2017
@visze
Copy link
Contributor

visze commented Nov 30, 2017

+1

I guess this issue is related to #353

This will also help me in defining good test cases for the serous bug #361

@roland-ewald
Copy link
Contributor Author

@visze Great! Any preference regarding output format or naming of the commands? I'm seeing different pros and cons for any of the formats that come to mind (XML, JSON, CSV), but so far I would be leaning towards JSON.

@pnrobinson
Copy link
Contributor

agree with JSON

@roland-ewald
Copy link
Contributor Author

@pnrobinson great; this should be fairly straightforward. I will try to prioritize this as it seems to be helpful for other issues (#361, or testing #372).

@holtgrewe
Copy link
Member

FWIW, the SV support will add limiting database building to certain genes/transcripts only.

@holtgrewe holtgrewe added this to the Backlog milestone Feb 7, 2019
@holtgrewe
Copy link
Member

We'll happily accept a tested PR but I'm flagging this as backlog as the ticket has been dormant.

@pnrobinson
Copy link
Contributor

@julesjacobsen -- is there a protobuf serialization of JannovarData objects? Could that be adapted to this purpose?

@julesjacobsen
Copy link
Contributor

@pnrobinson
Copy link
Contributor

@roland-ewald @holtgrewe thoughts on adding protobuf serialization to Jannovar (which get's us JSON nearly for free).

@roland-ewald
Copy link
Contributor Author

@pnrobinson @holtgrewe I think using protobuf here is a great idea.

JSON support will make integration testing (and, if necessary, versioning transcript sub-sets) much nicer, but does not necessarily have to be very fast IMHO.

@holtgrewe
Copy link
Member

Hm, I think that we're facing XKCD 927 here... I'd rather see a serialization to a minimal GFF or GTF sub set here than yet another standard. It's tempting to base it off what TranscriptModel looks like right now but in terms of portability it is not a big improvement over native serialization.

I would propose the following

  1. Allow to import only certain genes or transcripts from any supporte transcript database (available in develop).
  2. Use FST for serialization and deserialization to save some startup time.
  3. Write an exporter for RefSeq and ENSEMBL and adjust the "download" tool such that you can pass a directory with a data set descriptor INI file and the necessary GTF/GFF files.

This gives us

  • easy to create test data sets for which visualisation already exists in popular genome browsers
  • faster serialization/deserialization
  • using community standard file formats
  • possibility to curate data sets

... and all without the need for new parsers/data importers etc.

FWIW, I would probably be able to implement (2) for the upcoming v0.18 release but (3) is currently out of reach for me. I'd be happy to review tested PRs and support the development with discussions.

@roland-ewald
Copy link
Contributor Author

That's a valid point. Your suggestions would solve the problem as well. I'm not sure (2) is necessarily part of this problem (although a faster deserialization is nice). Point (3) may not even need to include adjustments to the download tool (the CLI). From my perspective, an easy way of using non-serialized data (be it in GFF or JSON) seems to be more relevant in use cases that interact with the API directly (e.g. for testing).

@julesjacobsen
Copy link
Contributor

julesjacobsen commented Feb 8, 2019

Can we put the faster de/serialisation discussion on hold for the time being. I'm happy with the situation as it is. Looking through the issues in FST there could be a couple of showstoppers, for instance there appears to be some odd behaviour using G1GC which is standard for Java 9+ The library looks great in principle, and is wonderfully simple to use (unlike protobuf), but I'm worried about possible incompatibilities.

I used protobuf because it is a fast serialisable data structure with excellent language bindings. I simply replicated the Jannovar object model to be able to store it on and read off disk fast. I was just being pragmatic, not trying to create a new data standard! However, using protobuf would mean that you could easily re-use the data in a Python application. The same could be said for any plain-text format, but you lose the speed and schema-based validation.

@holtgrewe holtgrewe modified the milestones: Backlog, 0.29 Feb 28, 2019
@holtgrewe
Copy link
Member

@roland-ewald any news on this?

@roland-ewald
Copy link
Contributor Author

@holtgrewe as there's some disagreement on the way to implement this and it's not really pressing, I suggest we close the issue for now (if you are OK with that).

@holtgrewe holtgrewe modified the milestones: 0.29, Backlog May 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants