New Feature: simplify curation of annotation transcripts #374

roland-ewald · 2017-11-29T08:14:36Z

Problem

The current strategy of serializing JannovarData makes it hard to manually edit the available transcript data. For example, to improve annotation speed and memory footprint one could imagine to only use a certain subset of the default transcript sets, or to include certain historic (but now out-dated) transcript versions for validation and comparison to previous data sets. One could also imagine mixing transcripts from different sources, e.g. RefSeq and Ensembl, to give a more complete picture (though I am not sure Jannovar can deal with this).

As a nice side effect, this should make integration-testing Jannovar much easier and faster, as it now only requires a text-based representation of the transcript data for a handful of transcripts: these can be converted to a JannovarData object without downloading additional files, and Jannovar's output for variants in these regions should be the same as with more transcripts.

Potential Solution

I suggest to add two new commands to jannovar-cli:

export: reads a given *.ser file and writes a text-based representation of all included data into a file (e.g. JSON, XML, or a set of CSV files).
import: reads the files in the same format as they are produced by export and generates a serialized JannovarData file from that, which can be used for annotation.

I'm currently working on code that does this in an ad-hoc manner and would be happy to contribute a PR for this, as this could be useful in many other settings as well. I am open to suggestions regarding the specific design (file format, how to name the commands, etc).

The text was updated successfully, but these errors were encountered:

visze · 2017-11-30T14:00:27Z

+1

I guess this issue is related to #353

This will also help me in defining good test cases for the serous bug #361

roland-ewald · 2017-11-30T16:03:14Z

@visze Great! Any preference regarding output format or naming of the commands? I'm seeing different pros and cons for any of the formats that come to mind (XML, JSON, CSV), but so far I would be leaning towards JSON.

pnrobinson · 2017-11-30T17:22:28Z

agree with JSON

roland-ewald · 2017-12-01T17:53:52Z

@pnrobinson great; this should be fairly straightforward. I will try to prioritize this as it seems to be helpful for other issues (#361, or testing #372).

holtgrewe · 2019-02-07T13:15:21Z

FWIW, the SV support will add limiting database building to certain genes/transcripts only.

holtgrewe · 2019-02-07T13:15:48Z

We'll happily accept a tested PR but I'm flagging this as backlog as the ticket has been dormant.

pnrobinson · 2019-02-07T13:18:12Z

@julesjacobsen -- is there a protobuf serialization of JannovarData objects? Could that be adapted to this purpose?

julesjacobsen · 2019-02-07T13:39:24Z

@pnrobinson naturally....
https://github.com/exomiser/Exomiser/blob/master/exomiser-core/src/main/proto/jannovar.proto
https://github.com/exomiser/Exomiser/blob/master/exomiser-core/src/main/java/org/monarchinitiative/exomiser/core/genome/jannovar/JannovarProtoConverter.java

It doesn't compress as well, but its about 2x faster to read. This will also serialise to/from JSON using the protobuf lib.

pnrobinson · 2019-02-07T15:01:27Z

@roland-ewald @holtgrewe thoughts on adding protobuf serialization to Jannovar (which get's us JSON nearly for free).

roland-ewald · 2019-02-07T15:05:50Z

@pnrobinson @holtgrewe I think using protobuf here is a great idea.

JSON support will make integration testing (and, if necessary, versioning transcript sub-sets) much nicer, but does not necessarily have to be very fast IMHO.

holtgrewe · 2019-02-07T21:55:53Z

Hm, I think that we're facing XKCD 927 here... I'd rather see a serialization to a minimal GFF or GTF sub set here than yet another standard. It's tempting to base it off what TranscriptModel looks like right now but in terms of portability it is not a big improvement over native serialization.

I would propose the following

Allow to import only certain genes or transcripts from any supporte transcript database (available in develop).
Use FST for serialization and deserialization to save some startup time.
Write an exporter for RefSeq and ENSEMBL and adjust the "download" tool such that you can pass a directory with a data set descriptor INI file and the necessary GTF/GFF files.

This gives us

easy to create test data sets for which visualisation already exists in popular genome browsers
faster serialization/deserialization
using community standard file formats
possibility to curate data sets

... and all without the need for new parsers/data importers etc.

FWIW, I would probably be able to implement (2) for the upcoming v0.18 release but (3) is currently out of reach for me. I'd be happy to review tested PRs and support the development with discussions.

roland-ewald · 2019-02-08T08:53:17Z

That's a valid point. Your suggestions would solve the problem as well. I'm not sure (2) is necessarily part of this problem (although a faster deserialization is nice). Point (3) may not even need to include adjustments to the download tool (the CLI). From my perspective, an easy way of using non-serialized data (be it in GFF or JSON) seems to be more relevant in use cases that interact with the API directly (e.g. for testing).

julesjacobsen · 2019-02-08T10:38:19Z

Can we put the faster de/serialisation discussion on hold for the time being. I'm happy with the situation as it is. Looking through the issues in FST there could be a couple of showstoppers, for instance there appears to be some odd behaviour using G1GC which is standard for Java 9+ The library looks great in principle, and is wonderfully simple to use (unlike protobuf), but I'm worried about possible incompatibilities.

I used protobuf because it is a fast serialisable data structure with excellent language bindings. I simply replicated the Jannovar object model to be able to store it on and read off disk fast. I was just being pragmatic, not trying to create a new data standard! However, using protobuf would mean that you could easily re-use the data in a Python application. The same could be said for any plain-text format, but you lose the speed and schema-based validation.

holtgrewe · 2019-05-07T07:06:35Z

@roland-ewald any news on this?

roland-ewald · 2019-05-07T20:30:34Z

@holtgrewe as there's some disagreement on the way to implement this and it's not really pressing, I suggest we close the issue for now (if you are OK with that).

roland-ewald changed the title ~~New Feature: simplify curation annotation transcripts~~ New Feature: simplify curation of annotation transcripts Nov 29, 2017

holtgrewe added this to the Backlog milestone Feb 7, 2019

holtgrewe added the discussion label Feb 7, 2019

holtgrewe modified the milestones: Backlog, 0.29 Feb 28, 2019

holtgrewe modified the milestones: 0.29, Backlog May 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Feature: simplify curation of annotation transcripts #374

New Feature: simplify curation of annotation transcripts #374

roland-ewald commented Nov 29, 2017

visze commented Nov 30, 2017

roland-ewald commented Nov 30, 2017

pnrobinson commented Nov 30, 2017

roland-ewald commented Dec 1, 2017

holtgrewe commented Feb 7, 2019

holtgrewe commented Feb 7, 2019

pnrobinson commented Feb 7, 2019

julesjacobsen commented Feb 7, 2019

pnrobinson commented Feb 7, 2019

roland-ewald commented Feb 7, 2019

holtgrewe commented Feb 7, 2019

roland-ewald commented Feb 8, 2019

julesjacobsen commented Feb 8, 2019 •

edited

Loading

holtgrewe commented May 7, 2019

roland-ewald commented May 7, 2019

New Feature: simplify curation of annotation transcripts #374

New Feature: simplify curation of annotation transcripts #374

Comments

roland-ewald commented Nov 29, 2017

Problem

Potential Solution

visze commented Nov 30, 2017

roland-ewald commented Nov 30, 2017

pnrobinson commented Nov 30, 2017

roland-ewald commented Dec 1, 2017

holtgrewe commented Feb 7, 2019

holtgrewe commented Feb 7, 2019

pnrobinson commented Feb 7, 2019

julesjacobsen commented Feb 7, 2019

pnrobinson commented Feb 7, 2019

roland-ewald commented Feb 7, 2019

holtgrewe commented Feb 7, 2019

roland-ewald commented Feb 8, 2019

julesjacobsen commented Feb 8, 2019 • edited Loading

holtgrewe commented May 7, 2019

roland-ewald commented May 7, 2019

julesjacobsen commented Feb 8, 2019 •

edited

Loading