-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Feature: simplify curation of annotation transcripts #374
Comments
@visze Great! Any preference regarding output format or naming of the commands? I'm seeing different pros and cons for any of the formats that come to mind (XML, JSON, CSV), but so far I would be leaning towards JSON. |
agree with JSON |
@pnrobinson great; this should be fairly straightforward. I will try to prioritize this as it seems to be helpful for other issues (#361, or testing #372). |
FWIW, the SV support will add limiting database building to certain genes/transcripts only. |
We'll happily accept a tested PR but I'm flagging this as backlog as the ticket has been dormant. |
@julesjacobsen -- is there a protobuf serialization of JannovarData objects? Could that be adapted to this purpose? |
@pnrobinson naturally.... It doesn't compress as well, but its about 2x faster to read. This will also serialise to/from JSON using the protobuf lib. |
@roland-ewald @holtgrewe thoughts on adding protobuf serialization to Jannovar (which get's us JSON nearly for free). |
@pnrobinson @holtgrewe I think using protobuf here is a great idea. JSON support will make integration testing (and, if necessary, versioning transcript sub-sets) much nicer, but does not necessarily have to be very fast IMHO. |
Hm, I think that we're facing XKCD 927 here... I'd rather see a serialization to a minimal GFF or GTF sub set here than yet another standard. It's tempting to base it off what I would propose the following
This gives us
... and all without the need for new parsers/data importers etc. FWIW, I would probably be able to implement (2) for the upcoming v0.18 release but (3) is currently out of reach for me. I'd be happy to review tested PRs and support the development with discussions. |
That's a valid point. Your suggestions would solve the problem as well. I'm not sure (2) is necessarily part of this problem (although a faster deserialization is nice). Point (3) may not even need to include adjustments to the download tool (the CLI). From my perspective, an easy way of using non-serialized data (be it in GFF or JSON) seems to be more relevant in use cases that interact with the API directly (e.g. for testing). |
Can we put the faster de/serialisation discussion on hold for the time being. I'm happy with the situation as it is. Looking through the issues in FST there could be a couple of showstoppers, for instance there appears to be some odd behaviour using G1GC which is standard for Java 9+ The library looks great in principle, and is wonderfully simple to use (unlike protobuf), but I'm worried about possible incompatibilities. I used protobuf because it is a fast serialisable data structure with excellent language bindings. I simply replicated the Jannovar object model to be able to store it on and read off disk fast. I was just being pragmatic, not trying to create a new data standard! However, using protobuf would mean that you could easily re-use the data in a Python application. The same could be said for any plain-text format, but you lose the speed and schema-based validation. |
@roland-ewald any news on this? |
@holtgrewe as there's some disagreement on the way to implement this and it's not really pressing, I suggest we close the issue for now (if you are OK with that). |
Problem
The current strategy of serializing
JannovarData
makes it hard to manually edit the available transcript data. For example, to improve annotation speed and memory footprint one could imagine to only use a certain subset of the default transcript sets, or to include certain historic (but now out-dated) transcript versions for validation and comparison to previous data sets. One could also imagine mixing transcripts from different sources, e.g. RefSeq and Ensembl, to give a more complete picture (though I am not sure Jannovar can deal with this).As a nice side effect, this should make integration-testing Jannovar much easier and faster, as it now only requires a text-based representation of the transcript data for a handful of transcripts: these can be converted to a
JannovarData
object without downloading additional files, and Jannovar's output for variants in these regions should be the same as with more transcripts.Potential Solution
I suggest to add two new commands to
jannovar-cli
:export
: reads a given*.ser
file and writes a text-based representation of all included data into a file (e.g. JSON, XML, or a set of CSV files).import
: reads the files in the same format as they are produced byexport
and generates a serializedJannovarData
file from that, which can be used for annotation.I'm currently working on code that does this in an ad-hoc manner and would be happy to contribute a PR for this, as this could be useful in many other settings as well. I am open to suggestions regarding the specific design (file format, how to name the commands, etc).
The text was updated successfully, but these errors were encountered: