Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up connect_edges #725

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

ivirshup
Copy link

@ivirshup ivirshup commented Mar 12, 2019

Currently, connect_edges is pretty slow. This speeds it up a lot

Here's an example of the speed up

import numba
import pandas as pd
import numpy as np
from datashader.bundling import connect_edges
New implementation
@numba.njit
def _edge_val_to_segments(vals):
    edge_vals = np.full(vals.size * 3, np.nan, np.dtype("float"))
    for i in numba.prange(len(vals)):
        pos = int(i * 3)
        edge_vals[pos:pos+2] = vals[i]
    return edge_vals

@numba.njit
def _format_graph_segments(nodes, edges):
    edge_table = np.full((edges.shape[0] * 3, 2), np.nan)
    for i in numba.prange(edges.shape[0]):
        pos = int(i * 3)
        s, t = edges[i]
        edge_table[pos  , 0] = nodes[s, 0]
        edge_table[pos  , 1] = nodes[s, 1]
        edge_table[pos+1, 0] = nodes[t, 0]
        edge_table[pos+1, 1] = nodes[t, 1]
    return edge_table

def _connect_edges(
    nodes,
    edges,
    source="source",
    target="target",
    weight=None,
    x="x",
    y="y",
    include_edge_id=False
):
    e = pd.DataFrame(
        _format_graph_segments(
            nodes[[x, y]].values,
            edges[[source, target]].values),
        columns=[x, y]
    )
    if weight is not None:
        e[weight] = _edge_val_to_segments(edges[weight].values)
    if include_edge_id:
        e["edge_id"] = _edge_val_to_segments(edges["id"])
    return e
# Data gen
def gen_graph(nnodes=100000, nedges=1000000):
    nodes = pd.DataFrame(
        np.random.random((nnodes, 2)),
        columns=list("xy")
    )
    edges = pd.DataFrame(
        np.random.randint(0, nnodes, size=(nedges, 2), dtype=int),
        columns=["source", "target"]
    )
    return nodes, edges

nodes, edges = gen_graph()

# Running:

 # Jit warmup
new = _connect_edges(nodes, edges)
orig = connect_edges(nodes, edges)
assert np.allclose(new, orig, equal_nan=True)


%timeit _connect_edges(nodes, edges)
# 58.5 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%time connect_edges(nodes, edges)
# CPU times: user 53.2 s, sys: 630 ms, total: 53.9 s
# Wall time: 54 s

I've labelled this as a work in progress for a few reasons:

  • I currently can't run all the tests locally (Trouble running tests with provided conda env #724), so opening this pull request will let me see if anything else fails
  • This could be sped up even more with njit(parallel=True) or njit(cache=True), but I don't see those being used a lot in this codebase. Should I set one of those? Is there a preference for one over the other?
  • test_directly_connect_with_weights[True] and test_directly_connect_without_weights[True] under test_bundling.py are failing, but the tests might be wrong. Here's what I mean:

Given an edge data frame that look like this:

   id  source  target  weight
0   0       0       1     1.0
1   1       0       2     1.0
2   2       0       3     1.0
3   3       0       4     1.0

I would think the output should have edge ids [0, 1, 2, 3]. However, the test says they should have ids [1, 2, 3, 4] (link). It looks to me like the edge_id is being set by id of the target node because the node dataframe that's passed also has an id field.

@jbednar
Copy link
Member

jbednar commented Mar 12, 2019

Thanks for the contribution! For the WIP items:

  • Hopefully if you use pytest ==3.9.3 and pytest-benchmark ==3.0.0 you can run the tests locally, or you can look at the results from CI above.
  • The datashader codebase was written for an early version of Numba, and while we've updated some parts, there are surely areas where we can make use of newer Numba features. So feel free to use anything now available in Numba!
  • ids starting at 0 look correct to me, but I haven't studied the code. Maybe @philippjfr remembers the details here and can comment?

@philippjfr
Copy link
Member

Maybe @philippjfr remembers the details here and can comment?

Don't think I've ever looked at the internals of that code although we do expose the functionality in holoviews.

@ivirshup
Copy link
Author

@philippjfr could you point out to me where that's exposed? I'd like to see an example of this argument being used.

I found the holoviews function connect_edges_pd, which does what I had previously assumed connect_edges(..., include_edge_id=True) would do: always making the index of the edge dataframe the edge ids, instead of assuming there would be an id field.

@ivirshup
Copy link
Author

While CI and installation stuff gets worked out, do I have the devs blessing to change the edge id tests?

@jbednar
Copy link
Member

jbednar commented Mar 20, 2019

I wish I could give you that blessing, but I won't have a chance to look at this code until Tuesday to see whether it was the code or the tests that was wrong before. Sorry about that!

@ivirshup
Copy link
Author

ivirshup commented Apr 1, 2019

@jbednar, any chance you've been able to look at those tests?

@jbednar
Copy link
Member

jbednar commented Apr 3, 2019

Sorry for the delay! I did look at them, but it's a mess! it looks like different decisions were made between HoloViews and Datashader about what to do when there is an id field available, and we haven't yet worked out what to do about that, how to maintain backwards compatibility , and how to deal with it. I had hoped we could address this before our release today, but other deadlines prevented us from fully working that out, so we have had to postpone it again to next week, after the release. If we find a problem we'll fix it and make a new release, merging your improvements either way. So we haven't forgotten about this; it's just delayed more...

@ivirshup
Copy link
Author

ivirshup commented Apr 5, 2019

No problem! Yeah, the current implementation was much more complicated than I was expecting.

I would note the current datashader function does return faulty values if the data doesn't match up exactly with the expectations.

For example, include_edge_id=True without having an id field messes up the layout.

import pandas as pd
import numpy as np
import datashader as ds
from datashader.layout import circular_layout
from datashader.bundling import connect_edges

n, m = 100, 1000

nodes = pd.DataFrame(["node"+str(i) for i in range(n)], columns=['name'])
edges = pd.DataFrame(np.random.randint(0,len(nodes), size=(m, 2)),
                     columns=['source', 'target'])

circular = circular_layout(nodes, uniform=False)

print(connect_edges(circular, edges).head())
print(connect_edges(circular, edges, include_edge_id=True).head())
#           x         y
# 0  0.149448  0.143471
# 1  0.037398  0.310265
# 2       NaN       NaN
# 3  0.999698  0.517384
# 4  0.002216  0.452981
#     edge_id         x              y
# 0  0.149448  0.143471   3.739811e-02
# 1  0.149448  0.310265  4.594811e-322
# 2       NaN       NaN            NaN
# 3  0.999698  0.517384   2.215678e-03
# 4  0.999698  0.452981  1.185758e-322

Additionally, it doesn't throw an error if you don't provide information it needs (like node positions).

I would've opened an issue, but figured this PR fixes it anyways.

@jlstevens
Copy link
Collaborator

I reran the Appveyor tests on this PR to try to see what is failing. Some old failures have gone away (presumably issues fixed on master) but the tests that are failing do seem to relate to this PR. In particulartest_directly_connect_with_weights seems to be failing due to unexpected results.

It is odd that this issue only appears on Windows but on that platform at least, either the unit test definition or the data generated using the code in this PR is incorrect. That is a shame as I would love to see this PR merged: those are some very nice speedups!

@jlstevens
Copy link
Collaborator

Actually, I'm not convinced that unit test is being on travis which would explain why only appveyor is failing right now!

@ivirshup
Copy link
Author

IIRC, different environments ran different sets of tests.

There was some discussion of this in #724, but I'm not sure if the issue has been solved since.

@maximlt
Copy link
Member

maximlt commented Nov 29, 2021

Sorry for the long delay, we're interested in this work, could you push an empty commit or a trivial change to get the test suite to run again? Thanks !

@ivirshup
Copy link
Author

Sorry for the long delay, we're interested in this work

Ha, no worries. I may not be able to promise much time to this now, but took a quick look over the code again.

Looking at the CI, it seems like those tests still need to be modified. It's clearly wrong that there are edges with ids [0, 1, 2, 3] in the input, but edge_ids [1, 2, 3, 4] are referenced in the output, right?

@jbednar jbednar added this to the v0.14.2 milestone Jul 19, 2022
@jbednar jbednar requested a review from ianthomas23 July 19, 2022 18:10
@jbednar jbednar assigned ianthomas23 and unassigned jlstevens Jul 19, 2022
@ianthomas23 ianthomas23 modified the milestones: v0.14.2, v0.14.3 Aug 10, 2022
@ivirshup ivirshup changed the title [WIP] Speed up connect_edges Speed up connect_edges Sep 6, 2023
@ianthomas23
Copy link
Member

@ivirshup I believe that you will be attending the NumFOCUS Project Summit next week so perhaps we can have a chat about this work then?

@ivirshup
Copy link
Author

ivirshup commented Sep 6, 2023

Sure. Did we get a list of people who are going to be there?

@ianthomas23
Copy link
Member

There is a summit-2023 channel in the NumFOCUS slack and I am assuming that everyone in the channel will be attending.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants