Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Visualise Pipeline objects #1993

Open
astrojuanlu opened this issue Jul 21, 2024 · 20 comments · May be fixed by #2241
Open

Visualise Pipeline objects #1993

astrojuanlu opened this issue Jul 21, 2024 · 20 comments · May be fixed by #2241

Comments

@astrojuanlu
Copy link
Member

AS A Kedro user
I WANT TO visualise Pipeline objects directly in notebooks
SO THAT

  1. I don't need the full Kedro Framework structure (a requirement for %run_viz)
  2. I can interactively visualise Pipeline objects while I am creating them

Originally #1459, extra context in #1833 (comment) reproduced below:

I am showcasing Kedro concepts on a notebook without creating a full-fledged project. Took https://github.com/ibis-project/kedro-ibis-tutorial/blob/main/03%20-%20First%20Steps%20with%20Kedro.ipynb as inspiration, and adapted it to Spark and Databricks (will try to publish that soon).

However, since there is no Kedro Framework project, there is no way I can visualise my pipelines, even though I have a Pipeline object perfectly defined:

image

It would be insanely awesome if I could do KedroViz().visualize(pipe).show() or something like that, without ever needing to set-up a Kedro project.

@yury-fedotov
Copy link
Contributor

@astrojuanlu interesting use case. Have you seen a lot that users define pipelines in notebooks or import them to there?

I thought vast majority of notebook usage is to do catalog.load("something") and then some EDA. While all pipeline definition is in .py files.

@astrojuanlu
Copy link
Member Author

Have you seen a lot that users define pipelines in notebooks

I have not, and probably the reason is that traditionally Kedro had taken sort of an anti-notebook stance. We evolved that in 2023, for example by writing https://docs.kedro.org/en/stable/notebooks_and_ipython/notebook-example/add_kedro_to_a_notebook.html

I've personally found it very handy to explain things to data scientists with notebooks when teaching. See for example https://github.com/ibis-project/kedro-ibis-tutorial/blob/main/03%20-%20First%20Steps%20with%20Kedro.ipynb, recording (very well received) or https://github.com/astrojuanlu/kedro-databricks-demo/blob/main/First%20Steps%20with%20Kedro%20on%20Databricks.ipynb (essentially the same thing, but with a ManagedTableDataset connecting to DBX UC). Being able to visualise the pipelines there directly would be awesome I think.

or import them to there?

We launched a feature earlier this year to do something like that https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_and_notebooks.html#load-node-line-magic it's for nodes rather than full pipelines though.

I thought vast majority of notebook usage is to do catalog.load("something") and then some EDA.

That's our impression too yes (and in fact I do that all the time). So this issue would be about taking that one little step further.

@rashidakanchwala rashidakanchwala moved this from Backlog to Inbox in Kedro-Viz Jul 29, 2024
@astrojuanlu
Copy link
Member Author

A user just asked about this.

@astrojuanlu astrojuanlu changed the title Visualise Pipeline objects directly in notebooks Visualise Pipeline objects Sep 6, 2024
@astrojuanlu
Copy link
Member Author

(And it had nothing to do with notebooks)

@KikiCS
Copy link

KikiCS commented Sep 6, 2024

Hello, I add some context for my use-case after sending a message on Slack.
Kedro viz diagrams are very useful for non-technical people wanting to get a high-level view of the data pipeline.
While documenting models in my company internal Notion, I thought including a kedro viz diagram would be super useful, as well as generating a new one every time a change to the pipeline is released.
I got the idea when I saw that Notion shows diagrams written in Mermaid, but I don't know and haven't checked if kedro viz is based on Mermaid under the hood.

@rashidakanchwala rashidakanchwala moved this from Inbox to Backlog in Kedro-Viz Sep 9, 2024
@astrojuanlu
Copy link
Member Author

Prior art: #1668 (comment)

@rashidakanchwala rashidakanchwala moved this from Backlog to Todo in Kedro-Viz Jan 13, 2025
@astrojuanlu astrojuanlu moved this from Todo to In Progress in Kedro-Viz Jan 13, 2025
@ravi-kumar-pilla
Copy link
Contributor

Hi @astrojuanlu ,

Did some experimental implementation and it seems to be feasible 💯 . Haven't tested complex parts to start off with. But the simple pipelines seems achievable with some limitations. I will be doing some more testing before documenting the limitations.

Thank you

image

@ravi-kumar-pilla ravi-kumar-pilla linked a pull request Jan 15, 2025 that will close this issue
5 tasks
@astrojuanlu
Copy link
Member Author

Fantastic @ravi-kumar-pilla ! So #2241 basically launches a Viz server and then embeds that as an iframe, right?

Do you think it's feasible to do this using only the frontend React component, without a server? To reduce overhead and have better control of what's presented. For example, it would be nice if the left toolbar, the node filter area, and the other toolbar weren't even displayed.

@ravi-kumar-pilla
Copy link
Contributor

Fantastic @ravi-kumar-pilla ! So #2241 basically launches a Viz server and then embeds that as an iframe, right?

You are right.

Do you think it's feasible to do this using only the frontend React component, without a server? To reduce overhead and have better control of what's presented. For example, it would be nice if the left toolbar, the node filter area, and the other toolbar weren't even displayed.

Yes, I am thinking about this as well. We can either inject a config header and hide parts of viz or as you said we can totally go with react component. I am exploring on this too. I will update on this. Thank you

@ravi-kumar-pilla
Copy link
Contributor

Hi @astrojuanlu ,

I tried using KedroViz directly in HTML but we do not bundle KedroViz to be used directly via a CDN link (or I could not find a way to use the package that way). I tried locally and there seems to be some compatibility issues. I reached out to @Huongg and she will have a look at the issue. For now, I tried config on top of starting server. This would be a first attempt at this feature. We can improve on the performance at later stages.

If the bundling approach takes time, I would suggest we go with the run_server approach and giving the user ability to configure what he/she can see on viz (like hiding everything except the flowchart view might be default). I have a PR which implements that (needs some polishing but works well). Let me know what you think.

Screenshot after configuring only flowchart view :

Image

cc: @rashidakanchwala

Thank you

@astrojuanlu
Copy link
Member Author

Thanks a lot @ravi-kumar-pilla !

I tried using KedroViz directly in HTML but we do not bundle KedroViz to be used directly via a CDN link (or I could not find a way to use the package that way).

Indeed, I don't see a bundled version in https://cdn.jsdelivr.net/npm/@quantumblack/kedro-viz/ What would be the cost of doing it?

I tried locally and there seems to be some compatibility issues.

Could you describe them a bit more?


I know the UI would look the same in either case but probably the DX is going to be much better if we avoid the server. A server needs to allocate a port, needs the proper Python dependencies installed, etc. I think we need to continue exploring the feasability of doing a JS-only solution.

@astrojuanlu
Copy link
Member Author

astrojuanlu commented Jan 23, 2025

Yesterday we briefly discussed this.

@ravi-kumar-pilla clarified that with the current proposal (#2241), even if we do use only the frontend, the user would still need to install Kedro Viz anyway.

Logging my current understanding of the situation:

From https://github.com/kedro-org/kedro-viz-standalone/blob/main/src/App.js, all that's needed is to go from a kedro.pipeline.Pipeline object to a JSON representation that resembles https://github.com/kedro-org/kedro-viz/blob/e418ecd/src/utils/data/spaceflights.mock.json

However, this is easier said than done. For starters, I couldn't find a schema that defines what properties are expected in that JSON - although they can be derived from other pleaces. The constructor suggests that there are 4 mandatory ones

PropTypes.shape({
edges: PropTypes.array.isRequired,
layers: PropTypes.array,
nodes: PropTypes.array.isRequired,
tags: PropTypes.array,

but actually the response returned by the API has a few more

nodes: List[NodeAPIResponse]
edges: List[GraphEdgeAPIResponse]
layers: List[str]
tags: List[NamedEntityAPIResponse]
pipelines: List[NamedEntityAPIResponse]
modular_pipelines: ModularPipelinesTreeAPIResponse
selected_pipeline: str

(this is a Pydantic model)

This, in turn, is generated here

def get_pipeline_response(
pipeline_id: Union[str, None] = None,
) -> Union[GraphAPIResponse, JSONResponse]:
"""API response for `/api/pipelines/pipeline_id`."""
if pipeline_id is None:
pipeline_id = data_access_manager.get_default_selected_pipeline().id
if not data_access_manager.registered_pipelines.has_pipeline(pipeline_id):
return JSONResponse(status_code=404, content={"message": "Invalid pipeline ID"})
modular_pipelines_tree = (
data_access_manager.create_modular_pipelines_tree_for_registered_pipeline(
pipeline_id
)
)
return GraphAPIResponse(
nodes=data_access_manager.get_nodes_for_registered_pipeline(pipeline_id),
edges=data_access_manager.get_edges_for_registered_pipeline(pipeline_id),
tags=data_access_manager.tags.as_list(),
layers=data_access_manager.get_sorted_layers_for_registered_pipeline(
pipeline_id
),
pipelines=data_access_manager.registered_pipelines.as_list(),
modular_pipelines=modular_pipelines_tree,
selected_pipeline=pipeline_id,
)

which gets populated here

def add_pipeline(self, registered_pipeline_id: str, pipeline: KedroPipeline):

In other words: the logic to transform a Python pipeline into the expected JSON structure is complex as it stands now.

I think this is taking me again to kedro-org/kedro#4363, which has several use cases, possibly including this one.

In the meantime, as part of the spike @ravi-kumar-pilla could you keep exploring the bundling issues just in case? And describe what you've found in the meantime.

@ravi-kumar-pilla
Copy link
Contributor

Hi @astrojuanlu ,

Thank you for the comment. You are 💯 correct on -

In other words: the logic to transform a Python pipeline into the expected JSON structure is complex as it stands now.

In the meantime, as part of the spike @ravi-kumar-pilla could you keep exploring the bundling issues just in case? And describe what you've found in the meantime.

Regarding the bundling, I have resolved the issue and the bundle can be used directly in html. However, the generated html works well in browser but have issues with jupyter notebook. I will try fixing the issues today (some issues are around window object which is different from browser window)

Once this is done, I will try to see what is the bare minimum requirements needed to get this working. As you mentioned in the comment, since we generate json via Kedro-Viz, we need to install kedro-viz. Instead of the complex viz backend.

On a side note, if we can use to_json() of kedro to generate the pipeline json and frontend can interpret it, we can have viz jupyter notebook experience much better.

Thank you

@ravi-kumar-pilla
Copy link
Contributor

Hi @astrojuanlu ,

I am able to display KedroViz using the bundle approach inside notebook (pending additional testing), but we can see a demo implementation in the PR.

Documenting the current approach and local testing methodology for reference.

Current Approach:

  • In KedroViz we will have a class KedroVizNotebook which exposes an api/method visualize. Below is the method definition -
# [TODO: will add options to display certain parts of viz, for now the default is only chart view. 
# We can also add more customization if needed]
def visualize(self, pipeline: Pipeline, catalog: DataCatalog = None, embed_in_notebook=True):
  • Internally we load and populate our kedroViz backend repositories
  • Instantiate a dummy catalog (i.e., all datasets are of type MemoryDataset)
  • Get the json required for the html
  • Inject the json data to the html template and save the html to a file .viz/viz_jupyter_exploration.html (filename is configurable)
  • Display an iframe pointing to the saved html file

Pre-requisites:

  • Installations required: Kedro, Kedro-Viz, jupyter notebook
  • To test locally we need to create a bundle using webpack. The PR has the webpack config. From root dir of kedroViz, execute
# Assuming you have webpack from package.json of kedroViz. 
# If not already installed do npm install webpack

npx webpack --mode development

# This will create a viz bundle. The bundle needs to be served for now as it is not published
# Use a local server for publishing. Navigate to the bundle folder `/dist` and run

python -m http.server 8000

# Make sure http://localhost:8000/kedroViz.bundle.js is accessible
  • Custom jupyter config (Needs discussion)
jupyter notebook --generate-config

# Go to config file path and add at end of the file
c.ContentsManager.allow_hidden = True

NOTE: The html content is currently saved to a file which is placed under .viz folder. This needs jupyter config to be updated, as by default the notebook cannot access hidden folders.

Testing Current Approach:

# In case of demo_project
cd demo_project
kedro jupyter notebook

# Run each cell present in demo-project/viz_jupyter_test.ipynb of the PR or 
# Instantiate your pipeline and execute below code in the jupyter cell
from kedro_viz.launchers.experimental_viz import KedroVizNotebook
KedroVizNotebook().visualize(pipe)

Other approaches:
a. Using KedroViz run_server with the pipeline information. This needs us to start a process which runs uvicorn server, serving a FASTAPI app specifically for notebook users. There might be some delay as we start a process and get uvicorn running.
b. Using the html content text directly i.e., display(HTML(html_text)). I faced issues with this approach like blank cell, window object not recognized etc), if anyone has experience in this, would be great to explore as we are not creating an extra file.
c. Creating a temp file which gets deleted after jupyter session. Somehow the notebook could not access these files, again not sure if I was missing some config.

Questions:
i. Can we have the html file generated in the user's cwd where they launched the notebook ?
ii. If not (a), is it fine to ask the user update jupyter config to allow hidden files discovery and we always save it to .viz ?
iii. Do you know a way to directly display html without saving the file ?

Some questions related to testing and expectations from MVP or first draft:

iv. Since this was a spike, I did not test complex pipelines but I hope the approach works. Can we assume testing the demo_project size pipeline a success for first draft ?
v. Could you please let me know other env (i.e., databricks etc) to test this on ?
vi. For first draft, what are the expectations in case the above approach works ?

Next steps:

  1. There is an unpolished implementation in the PR, which needs modifications based on the discussion outcome here
  2. Publish the kedro-viz.bundle.js to npm which can be referred via CDN directly in the html text.

cc: @rashidakanchwala

Thank you

@astrojuanlu
Copy link
Member Author

Thanks a lot for the update @ravi-kumar-pilla !

b. Using the html content text directly i.e., display(HTML(html_text)). I faced issues with this approach like blank cell, window object not recognized etc), if anyone has experience in this, would be great to explore as we are not creating an extra file.

That's what I had in mind (this or the _repr_html_ method). Definitely it would be preferable to not create an extra file (which, IIUC, requires the custom Jupyter config)

Can we assume testing the demo_project size pipeline a success for first draft ?

Yes!

@astrojuanlu
Copy link
Member Author

In the interest of time boxing this effort and ship incremental improvements towards the final goal, for now let's focus the on having the Webpack bundle introduced in #2241 be part of the normal Kedro Viz release flow.

In parallel, we can show the current PoC to users to gather their early feedback.

@ravi-kumar-pilla
Copy link
Contributor

In the interest of time boxing this effort and ship incremental improvements towards the final goal, for now let's focus the on having the Webpack bundle introduced in #2241 be part of the normal Kedro Viz release flow.

In parallel, we can show the current PoC to users to gather their early feedback.

Sounds good @astrojuanlu . I will introduce the bundling into the current workflow.

@astrojuanlu astrojuanlu moved this to In Progress in Kedro 🔶 Jan 28, 2025
@astrojuanlu
Copy link
Member Author

Bringing part of the discussion on #2256 here:

Looks like there are some present challenges with the bundling approach #2256 (comment)

@rashidakanchwala commented:

We need to evaluate maintainability, effort vs. impact, and alignment with future Kedro-Viz developments. Given that the second half of 2025 will focus on the Pipeline Editor, which will be a major architectural shift, I think we should take a step back and plan PS sessions to align on the best direction for Kedro-Viz.

Next Steps, can we do a PS session:

* Original problem - Kedro-Viz in Notebooks
  
  * Reviewing the user feedback we’ve received so far on the high fidelity prototypes
  * Understand what we can do to release an MVP soon

* UMD Bundling
  
  * What additional benefits would UMD bundling bring beyond notebook integration? cons to this?

and also another session on :-

* Broader Kedro-Viz Architecture
  
  * Flowchart Rendering as a separate Library
  * Separate the logic of Kedro project --> structure json pipeline in another package ([Spin off pipeline inspection to separate package kedro#4363](https://github.com/kedro-org/kedro/issues/4363))
  * Pros/cons of the above two?
  * Would this help future-proof Kedro-Viz, especially for the Pipeline Editor?

@ravi-kumar-pilla
Copy link
Contributor

Hi @astrojuanlu ,

I had a discussion with Rashida and we agreed on shipping this feature as experimental using the production bundle. Here is what we will do to ship the feature in the next release -

  1. Create a folder umd in the current kedro-viz GH repository
  2. Upload the production bundles kedro-viz.production.min.js and vendors.production.min.js to the umd folder
  3. Automate the process of bundling and updating the bundles via make release, i.e., add a step to update this folder when we do a new release
  4. Use these bundles in the backend of our NotebookVisualizer class and create a html template which will be displayed in the notebook cell.

Let me know what you think of the approach.

Thank you

@astrojuanlu
Copy link
Member Author

@ravi-kumar-pilla Let's proceed 👍🏼

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Status: In Progress
Development

Successfully merging a pull request may close this issue.

5 participants