questions about xg data model format #41

ekg · 2016-09-18T10:02:39Z

I've received these questions. I thought it would be helpful to clarify here.

Say we consider the first 3 nodes of the Figure-5 of vg draft example as a small example to discuss here. Then node-id: seq(node-id) of example graph is as follows.
n1: CAAG
n2: TAAG
n3: G

Node sequence vector is Vs = seq(n1) + seq (n2) + seq(n3) = CAAGTAAGG (after concatenating).

Compressed integer vector of Vs for storage: (as mentioned 3 bits for base: A: 000, T: 001, G:010, C:011, N: 100)
Vs = [ 011000000010001000000010010]
and |Vs| = (no of bits per each base ) x ( no of characters in Vs) = 3 x 9 =27.

Actually, V_s = [ 011, 000, 000, 010, 001, 000, 000, 010, 010 ]. From the perspective of xg, we aren't thinking about the specific backing implementation. We say this is a compressed integer vector of some kind. It has one entry per base in the graph. It may be that we have more or less bits depending on the alphabet of the system, although we are currently focused on the alphabet A, C, T, G, N.

And the second bit vector is Vb = [100010001] and |Vb| = 9

If Vs is a bit vector then |Vs| > |Vb| right ? but it is mentioned that |Vs| = |Vb|. If Vs is not a bit vector but only a character vector obtained by concatenating the character sequences of each node then |Vs| = |Vb| right ?

Hopefully my comment above clarifies this query.

It was mentioned in the draft that Node-ids are stored in the wavelet tree Vi ordered by their appearance in Vs But the appearance in Vs is already ordered right ? It was given
Vs = seq(n1) + seq(n2) +...+ seq(nN)

So node-ids (integers).... (1,2,3) in the current example are stored in the wavelet tree Vi ?

As the node-ids are always unique and incremental in the context of vg (say for example in the Figure 5 of the draft the node ids are 1,2,3,4,5,6,7,8,9,10,11), what does the rank of a particular node by its id mean ? As each node-id appears just once right ? The concatenation of sequence vector is also happening in a ordered way (n1, n2,...nN), so no repetitions of node-ids anywhere.

Could you please clarify the role of storage of node-ids in the wavelet tree and what does rank ‘n’ of a particular node-id mean ?

Each node id occurs just once, but there is not a limitation that they are sequential. The reason is that we may want to construct an xg from a subset of a larger graph, which is not possible if we can only operate on graphs where node ids are equivalent to the rank of the node in V_s.

I remain cautious about this design decision. In many cases we can probably make the ids equivalent to the ranks in V_s. There would be a large benefit in the performance of xg.

Does this help explain?

The text was updated successfully, but these errors were encountered:

vnsk · 2016-09-18T10:25:54Z

Okay..so essentially V_s is of same size of V_b and the notion of bits per base is for efficient storage purpose.

In the case if you load a sub graph from the original huge graph for constructing the xg, still the node-ids remain unique right for that particular sub-graph (as node-id is never repeating in the vg graph its always incremental whenever a new node is created)?

Could you please explain how exactly a wavelet tree of integer node-ids of say any small sub graph look like and what is rank of node-id in this context? (As in the draft rank 'n' of a particular node -id 'i' is given as n = rank_i(V_i,1) where V_i is the wavelet tree of node-ids) .

Thanks Erik !!

ekg · 2016-09-20T09:53:29Z

Also in the preceding mail context, even if we consider sub graph, yet the node-ids will be unique right ? Since in vg, a node-id never repeats so whatever sub graph we pick..it will contain unique ids right ? Then, what actually you say as a rank of node-id if every node is just occurring once ..and what is the role of wavelet tree ..if you can directly use the rank and select on the second bit vector for any hit-node id to extarct the node label? Am I missing anything here. Could you please explain the role of wavelet tree for storing node-ids and what is a rank of node-id ?

I had originally designed xg to handle discontinuity in ids using a wavelet tree. However, now the wavelet tree is only used in the paths (where it is used for instance to determine all the occurrences of a node in a path).

It used this commented-out structure: https://github.com/vgteam/xg/blob/master/src/xg.hpp#L316-L317.

At some point I changed to recording the minimum id of the input and using this as an offset between ids and ranks. https://github.com/vgteam/xg/blob/master/src/xg.cpp#L1138-L1152.

I think that this may have not been correctly implemented as the id space must be filled out entirely in the bit vector on which this rank operation is done. This could explain some problems.

Specifically, we need to iterate over the whole range from min_id to max_id in this loop: https://github.com/vgteam/xg/blob/master/src/xg.cpp#L540-L553

The length of all the sequence-related vectors would need to be as long as min_id to max_id. This is not such a problem as long as these are compressed and there are not enormous gaps in the range that might trip us up by requiring too much memory at construction time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

questions about xg data model format #41

questions about xg data model format #41

ekg commented Sep 18, 2016

vnsk commented Sep 18, 2016

ekg commented Sep 20, 2016

questions about xg data model format #41

questions about xg data model format #41

Comments

ekg commented Sep 18, 2016

vnsk commented Sep 18, 2016

ekg commented Sep 20, 2016