Speed up connect_edges #725

ivirshup · 2019-03-12T02:17:42Z

Currently, connect_edges is pretty slow. This speeds it up a lot

Here's an example of the speed up

import numba
import pandas as pd
import numpy as np
from datashader.bundling import connect_edges

New implementation

@numba.njit
def _edge_val_to_segments(vals):
    edge_vals = np.full(vals.size * 3, np.nan, np.dtype("float"))
    for i in numba.prange(len(vals)):
        pos = int(i * 3)
        edge_vals[pos:pos+2] = vals[i]
    return edge_vals

@numba.njit
def _format_graph_segments(nodes, edges):
    edge_table = np.full((edges.shape[0] * 3, 2), np.nan)
    for i in numba.prange(edges.shape[0]):
        pos = int(i * 3)
        s, t = edges[i]
        edge_table[pos  , 0] = nodes[s, 0]
        edge_table[pos  , 1] = nodes[s, 1]
        edge_table[pos+1, 0] = nodes[t, 0]
        edge_table[pos+1, 1] = nodes[t, 1]
    return edge_table

def _connect_edges(
    nodes,
    edges,
    source="source",
    target="target",
    weight=None,
    x="x",
    y="y",
    include_edge_id=False
):
    e = pd.DataFrame(
        _format_graph_segments(
            nodes[[x, y]].values,
            edges[[source, target]].values),
        columns=[x, y]
    )
    if weight is not None:
        e[weight] = _edge_val_to_segments(edges[weight].values)
    if include_edge_id:
        e["edge_id"] = _edge_val_to_segments(edges["id"])
    return e

# Data gen
def gen_graph(nnodes=100000, nedges=1000000):
    nodes = pd.DataFrame(
        np.random.random((nnodes, 2)),
        columns=list("xy")
    )
    edges = pd.DataFrame(
        np.random.randint(0, nnodes, size=(nedges, 2), dtype=int),
        columns=["source", "target"]
    )
    return nodes, edges

nodes, edges = gen_graph()

# Running:

 # Jit warmup
new = _connect_edges(nodes, edges)
orig = connect_edges(nodes, edges)
assert np.allclose(new, orig, equal_nan=True)


%timeit _connect_edges(nodes, edges)
# 58.5 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%time connect_edges(nodes, edges)
# CPU times: user 53.2 s, sys: 630 ms, total: 53.9 s
# Wall time: 54 s

I've labelled this as a work in progress for a few reasons:

I currently can't run all the tests locally (Trouble running tests with provided conda env #724), so opening this pull request will let me see if anything else fails
This could be sped up even more with njit(parallel=True) or njit(cache=True), but I don't see those being used a lot in this codebase. Should I set one of those? Is there a preference for one over the other?
test_directly_connect_with_weights[True] and test_directly_connect_without_weights[True] under test_bundling.py are failing, but the tests might be wrong. Here's what I mean:

Given an edge data frame that look like this:

   id  source  target  weight
0   0       0       1     1.0
1   1       0       2     1.0
2   2       0       3     1.0
3   3       0       4     1.0

I would think the output should have edge ids [0, 1, 2, 3]. However, the test says they should have ids [1, 2, 3, 4] (link). It looks to me like the edge_id is being set by id of the target node because the node dataframe that's passed also has an id field.

jbednar · 2019-03-12T21:57:50Z

Thanks for the contribution! For the WIP items:

Hopefully if you use pytest ==3.9.3 and pytest-benchmark ==3.0.0 you can run the tests locally, or you can look at the results from CI above.
The datashader codebase was written for an early version of Numba, and while we've updated some parts, there are surely areas where we can make use of newer Numba features. So feel free to use anything now available in Numba!
ids starting at 0 look correct to me, but I haven't studied the code. Maybe @philippjfr remembers the details here and can comment?

philippjfr · 2019-03-13T00:05:27Z

Maybe @philippjfr remembers the details here and can comment?

Don't think I've ever looked at the internals of that code although we do expose the functionality in holoviews.

ivirshup · 2019-03-13T03:11:00Z

@philippjfr could you point out to me where that's exposed? I'd like to see an example of this argument being used.

I found the holoviews function connect_edges_pd, which does what I had previously assumed connect_edges(..., include_edge_id=True) would do: always making the index of the edge dataframe the edge ids, instead of assuming there would be an id field.

ivirshup · 2019-03-15T12:57:58Z

While CI and installation stuff gets worked out, do I have the devs blessing to change the edge id tests?

jbednar · 2019-03-20T04:52:13Z

I wish I could give you that blessing, but I won't have a chance to look at this code until Tuesday to see whether it was the code or the tests that was wrong before. Sorry about that!

ivirshup · 2019-04-01T12:33:52Z

@jbednar, any chance you've been able to look at those tests?

jbednar · 2019-04-03T17:28:12Z

Sorry for the delay! I did look at them, but it's a mess! it looks like different decisions were made between HoloViews and Datashader about what to do when there is an id field available, and we haven't yet worked out what to do about that, how to maintain backwards compatibility , and how to deal with it. I had hoped we could address this before our release today, but other deadlines prevented us from fully working that out, so we have had to postpone it again to next week, after the release. If we find a problem we'll fix it and make a new release, merging your improvements either way. So we haven't forgotten about this; it's just delayed more...

ivirshup · 2019-04-05T01:15:37Z

No problem! Yeah, the current implementation was much more complicated than I was expecting.

I would note the current datashader function does return faulty values if the data doesn't match up exactly with the expectations.

For example, include_edge_id=True without having an id field messes up the layout.

import pandas as pd
import numpy as np
import datashader as ds
from datashader.layout import circular_layout
from datashader.bundling import connect_edges

n, m = 100, 1000

nodes = pd.DataFrame(["node"+str(i) for i in range(n)], columns=['name'])
edges = pd.DataFrame(np.random.randint(0,len(nodes), size=(m, 2)),
                     columns=['source', 'target'])

circular = circular_layout(nodes, uniform=False)

print(connect_edges(circular, edges).head())
print(connect_edges(circular, edges, include_edge_id=True).head())
#           x         y
# 0  0.149448  0.143471
# 1  0.037398  0.310265
# 2       NaN       NaN
# 3  0.999698  0.517384
# 4  0.002216  0.452981
#     edge_id         x              y
# 0  0.149448  0.143471   3.739811e-02
# 1  0.149448  0.310265  4.594811e-322
# 2       NaN       NaN            NaN
# 3  0.999698  0.517384   2.215678e-03
# 4  0.999698  0.452981  1.185758e-322

Additionally, it doesn't throw an error if you don't provide information it needs (like node positions).

I would've opened an issue, but figured this PR fixes it anyways.

jlstevens · 2020-01-28T16:37:08Z

I reran the Appveyor tests on this PR to try to see what is failing. Some old failures have gone away (presumably issues fixed on master) but the tests that are failing do seem to relate to this PR. In particulartest_directly_connect_with_weights seems to be failing due to unexpected results.

It is odd that this issue only appears on Windows but on that platform at least, either the unit test definition or the data generated using the code in this PR is incorrect. That is a shame as I would love to see this PR merged: those are some very nice speedups!

jlstevens · 2020-01-28T17:06:25Z

Actually, I'm not convinced that unit test is being on travis which would explain why only appveyor is failing right now!

ivirshup · 2020-01-29T03:29:41Z

IIRC, different environments ran different sets of tests.

There was some discussion of this in #724, but I'm not sure if the issue has been solved since.

maximlt · 2021-11-29T16:38:56Z

Sorry for the long delay, we're interested in this work, could you push an empty commit or a trivial change to get the test suite to run again? Thanks !

ivirshup · 2021-12-10T20:40:43Z

Sorry for the long delay, we're interested in this work

Ha, no worries. I may not be able to promise much time to this now, but took a quick look over the code again.

Looking at the CI, it seems like those tests still need to be modified. It's clearly wrong that there are edges with ids [0, 1, 2, 3] in the input, but edge_ids [1, 2, 3, 4] are referenced in the output, right?

ianthomas23 · 2023-09-06T14:59:23Z

@ivirshup I believe that you will be attending the NumFOCUS Project Summit next week so perhaps we can have a chat about this work then?

ivirshup · 2023-09-06T16:18:42Z

Sure. Did we get a list of people who are going to be there?

ianthomas23 · 2023-09-06T16:28:44Z

There is a summit-2023 channel in the NumFOCUS slack and I am assuming that everyone in the channel will be attending.

Speed up connect_edges

7fa33cb

ivirshup force-pushed the speedup_connectedges branch from 1979185 to 7fa33cb Compare March 12, 2019 03:36

ivirshup mentioned this pull request Mar 13, 2019

Trouble running tests with provided conda env #724

Open

jbednar assigned jlstevens Jan 27, 2020

Empty commit to trigger CI

68386f3

jbednar added this to the v0.14.2 milestone Jul 19, 2022

jbednar requested a review from ianthomas23 July 19, 2022 18:10

jbednar assigned ianthomas23 and unassigned jlstevens Jul 19, 2022

ianthomas23 modified the milestones: v0.14.2, v0.14.3 Aug 10, 2022

ivirshup changed the title ~~[WIP] Speed up connect_edges~~ Speed up connect_edges Sep 6, 2023

ivirshup added 2 commits September 6, 2023 14:44

Merge branch 'main' into speedup_connectedges

83150ac

flake stuff

8bac622

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up connect_edges #725

Speed up connect_edges #725

ivirshup commented Mar 12, 2019 •

edited by jbednar

Loading

jbednar commented Mar 12, 2019

philippjfr commented Mar 13, 2019

ivirshup commented Mar 13, 2019

ivirshup commented Mar 15, 2019

jbednar commented Mar 20, 2019

ivirshup commented Apr 1, 2019

jbednar commented Apr 3, 2019

ivirshup commented Apr 5, 2019 •

edited

Loading

jlstevens commented Jan 28, 2020

jlstevens commented Jan 28, 2020

ivirshup commented Jan 29, 2020

maximlt commented Nov 29, 2021

ivirshup commented Dec 10, 2021

ianthomas23 commented Sep 6, 2023

ivirshup commented Sep 6, 2023

ianthomas23 commented Sep 6, 2023

Speed up connect_edges #725

Are you sure you want to change the base?

Speed up connect_edges #725

Conversation

ivirshup commented Mar 12, 2019 • edited by jbednar Loading

jbednar commented Mar 12, 2019

philippjfr commented Mar 13, 2019

ivirshup commented Mar 13, 2019

ivirshup commented Mar 15, 2019

jbednar commented Mar 20, 2019

ivirshup commented Apr 1, 2019

jbednar commented Apr 3, 2019

ivirshup commented Apr 5, 2019 • edited Loading

jlstevens commented Jan 28, 2020

jlstevens commented Jan 28, 2020

ivirshup commented Jan 29, 2020

maximlt commented Nov 29, 2021

ivirshup commented Dec 10, 2021

ianthomas23 commented Sep 6, 2023

ivirshup commented Sep 6, 2023

ianthomas23 commented Sep 6, 2023

ivirshup commented Mar 12, 2019 •

edited by jbednar

Loading

ivirshup commented Apr 5, 2019 •

edited

Loading