Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CS might add links where none are given from odgi bin #48

Open
subwaystation opened this issue May 12, 2020 · 5 comments
Open

CS might add links where none are given from odgi bin #48

subwaystation opened this issue May 12, 2020 · 5 comments
Labels
bug Something isn't working

Comments

@subwaystation
Copy link
Member

Test data set:
svg_opg1_issue_12052020.zip

Command lines used:

odgi build -g svg1.gfa -o svg1.gfa.og
/home/heumos/git/odgi/bin/odgi bin -i svg1.gfa.og -f svg1.gfa.og.fasta -j -w 1 -s -g > svg1.gfa.og.w1.g.json
/home/heumos/git/odgi/bin/odgi bin -i svg1.gfa.og -f svg1.gfa.og.fasta -j -w 1 -s > svg1.gfa.og.w1.json
python ~/git/component_segmentation/segmentation.py -j svg1.gfa.og.w1.json -f svg1.gfa.og.fasta -o ./cs_svg1_issue_12052020

In the resulting JSON output, from e.g. line 182, I can observe the following:

                    "upstream": 4,
                    "downstream": 5,
                    "participants": [
                        true,
                        true,
                        false
                    ]

But if you take a look at the odgi bin output, with including -g option, only 2 links for participant number 3 aka 6 are listed. No links at all should be present for the first two paths. So I am wondering where these links should come from? Am I missing something? @josiahseaman

@josiahseaman
Copy link
Member

I agree this is a bug and an issue. I don't really have ideas for where this would come from. There's a number of places like cut_points or bins cutting Nodes in half. It could be a bad sort earlier. It's definitely high priority to resolve this.

@josiahseaman
Copy link
Member

This issue doesn't seem to crop up in the runB1phi1 test data. What is the " -g option, "?

@josiahseaman
Copy link
Member

Looks like -g removes both gap links (useless) as well as begin and end links to bin 0 (debatable but still useful info). The last departure column is the adjacent connectors. Importantly, this is based on the absence of evidence that a path uses a LinkColumn.

Currently output:

"departures": [
                {
                    "upstream": 4,
                    "downstream": 12,
                    "participants": [
                        false,
                        false,
                        true
                    ]
                },
                {
                    "upstream": 4,
                    "downstream": 5,
                    "participants": [
                        true,
                        true,
                        false
                    ]
                }
            ],

Based on the GFA:

S	1	AGGA
S	2	AGTG
S	3	TCC
S	4	TCTCAG
P	5	1+,3+,4+	8M,1M,1M
P	5-	1+,3-,4+	8M,1M,1M
P	6	1+,4+,2+	8M,1M,1M

We can change the Node labels to be less confusing (matching the bin id).

S	1	AGGA
S	5	AGTG
S	9	TCC
S	12	TCTCAG
P	"5"	1+,9+,12+
P	"5-"	1+,9-,12+
P	"6"	1+,12+,5+

Put this way we can see the links reflected in svg1.gfa.og.w1.g.json

Path without -g with -g
"5" [[0,1],[4,9],[17,0]] []
"5-" [[0,1],[4,9],[17,0]] []
"6" [[0,1],[4,12],[17,5],[8,0]] [[4,12],[17,5]]

[4,9] is skipped since they have no content in the next node. The "adjacent connector" is here because Nodes 5 and 9 were merged: "{"first_bin": 5,"last_bin": 11,"occupants": [true,true,true],".

After looking at this for a long time, I'm not confident that this is an error in the output. It looks correct to me. There definitely are errors in the ordering of SARS-CoV-2 Link Columns, but this file does not appear to have those same errors.

I've posted v15 sparse CS outputs in this drive link for reference: [link]

@mandosoft
Copy link
Contributor

Taking a look at this one

@josiahseaman
Copy link
Member

From Dmytro:
I know what happens in the tiny test.
There are three components: c1=[1,4], c2=[5,11], c3=[12,17]. The routine add_ajacent_connector_colum assumes they come exactly in this order c1-c2-c3 and adds the links to connect them: link1=[4,5], link2=[11,12]. But actually the component order is c1-c3-c2, and this is perfectly encoded by odgi and the main segmentation code: link1=[4,12], link2=[17,5].

I think add_ajacent_connector_colum is redundant for the component-component linking. The only useful place for this routine is to add the incoming/outgoing link to the very first/last component in the components natural order. (edited)

actually, this loop should do the job:

for comp in components:
    if len(comp.departures) == 0:  # no departures == add one link to the "outer space"
        add_ajacent_connector_colum(comp, None)
    if len(comp.arrivals) == 0:  # no arrivals == add one link from the "outer space"
        add_ajacent_connector_colum(None, comp)

with link participants == component path_ids

or some more careful work should be done to find the "loosen" paths and connect them to the "outer space".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants