-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
questions about xg data model format #41
Comments
Okay..so essentially V_s is of same size of V_b and the notion of bits per base is for efficient storage purpose. In the case if you load a sub graph from the original huge graph for constructing the xg, still the node-ids remain unique right for that particular sub-graph (as node-id is never repeating in the vg graph its always incremental whenever a new node is created)? Could you please explain how exactly a wavelet tree of integer node-ids of say any small sub graph look like and what is rank of node-id in this context? (As in the draft rank 'n' of a particular node -id 'i' is given as n = rank_i(V_i,1) where V_i is the wavelet tree of node-ids) . Thanks Erik !! |
I had originally designed xg to handle discontinuity in ids using a wavelet tree. However, now the wavelet tree is only used in the paths (where it is used for instance to determine all the occurrences of a node in a path). It used this commented-out structure: https://github.com/vgteam/xg/blob/master/src/xg.hpp#L316-L317. At some point I changed to recording the minimum id of the input and using this as an offset between ids and ranks. https://github.com/vgteam/xg/blob/master/src/xg.cpp#L1138-L1152. I think that this may have not been correctly implemented as the id space must be filled out entirely in the bit vector on which this rank operation is done. This could explain some problems. Specifically, we need to iterate over the whole range from min_id to max_id in this loop: https://github.com/vgteam/xg/blob/master/src/xg.cpp#L540-L553 The length of all the sequence-related vectors would need to be as long as min_id to max_id. This is not such a problem as long as these are compressed and there are not enormous gaps in the range that might trip us up by requiring too much memory at construction time. |
I've received these questions. I thought it would be helpful to clarify here.
Actually,
V_s = [ 011, 000, 000, 010, 001, 000, 000, 010, 010 ]
. From the perspective of xg, we aren't thinking about the specific backing implementation. We say this is a compressed integer vector of some kind. It has one entry per base in the graph. It may be that we have more or less bits depending on the alphabet of the system, although we are currently focused on the alphabet A, C, T, G, N.Hopefully my comment above clarifies this query.
Each node id occurs just once, but there is not a limitation that they are sequential. The reason is that we may want to construct an xg from a subset of a larger graph, which is not possible if we can only operate on graphs where node ids are equivalent to the rank of the node in V_s.
I remain cautious about this design decision. In many cases we can probably make the ids equivalent to the ranks in V_s. There would be a large benefit in the performance of xg.
Does this help explain?
The text was updated successfully, but these errors were encountered: