This package contains a workflow for creating knowledge graphs from python repositories. In order to use it, first run:
pip install requirements.txt
First, import the package with the following code:
import sys, os
sys.path.append(os.path.abspath('package'))
from package import KnowledgeGraphBuilder
Next, initialize the KnowledgeGraphBuilder
object. In order to build the knowledge graph for large repositories, it is highly recommended to add your github access token to the KnowledgeGraphBuilder
class constructor (more github API requests can be done with a token). If you have no token, leave it empty: KnowledgeGraphBuilder()
.
kgb = KnowledgeGraphBuilder('your_github_token_here')
repograph = kgb.build_knowledge_graph(repo_name='scikit-learn/scikit-learn')
KnowledgeGraphBuilder.build_knowledge_graph()
function parameters:
- repo_name: Name of the repository. Must match the format "owner/repo_name", as it is used for github API calls.
- graph_type (optional): Type of subgraph to build from the functions. Can be "CFG" (Control Flow Graph) or "AST" (Abstract Syntax Tree). Default is "CFG".
- num_of_PRs (optional): Number of pull requests to retrieve in detail. Defaults to 0 (all).
- create_embedding (optional): Whether to create embeddings for the nodes. Defaults to False.
- repo_path_modifier (optional): Path modifier for the repository for cases when only a subfolder is meant to be parsed.
- URI (optional): URI for the Neo4J data saving.
- user (optional): Username for the Neo4J data saving.
- password (optional): Password for the Neo4J data saving.
Returns:
- object: Returns a collection of dataframes.
The knowledge graph object has the following keys:
- function_nodes: Call graph nodes, each representing a function
- function_edges: Call graph edges (function calls)
- subgraph_nodes: Subgraph nodes - Either the AST or CFG of a selected function's code.
- subgraph_edges: Subgraph edges (AST or CFG edges)
- subgraph_function_edges: Subgraph-node to Callgraph-node edges
- function_subgraph_edges: Callgraph-node to Subgraph-node edges
- import_nodes: Imported packages used in the repository (nodes in the graph)
- import_function_edges: Import nodes connected to functions that use them
- issues: Open issues about the repositoy
- pr_nodes: Pull requests. The summary text is stored
- pr_function_edges: Connects PR nodes to functions of files that were modified in that PR. Modification status and added/deleted rows are stored.
- issue_nodes: Issues collected from the repository, stored as nodes.
- issue_pr_edges: Issue nodes connected to the PRs solving them.
- artifacts: Artifacts of the repo collected into a dataframe.
- artifacts: Actions of the repo collected into a dataframe.
Check the object keys:
repograph.keys()
Check dataframes:
repograph['function_nodeds'].head()
Create a HTML visualizaiton of the graph with the visualize_graph
function. NOTE: for large graphs, it is advised to only plot a fraction of the nodes, othervise the visualization might not render properly. Parameters:
- repograph: The dictionary containing the created repository graph.
- show_subgraph_nodes (optional): Whether to plot the subgraph (CFG or AST) nodes. Defaults to False.
- save_path (optional): The file path to save the visualization. Defaults to "./graph.html".
kgb.visualize_graph(repograph)
Saving the graph in different formats.
Saving and loading the resulting graph dictionary as a pickle.
import pickle
with open('graph.pkl', 'wb') as f:
pickle.dump(repograph, f)
import pickle
with open('graph.pkl', 'rb') as f:
repograph = pickle.load(f)
The result can be saved to a Neo4j database by calling the store_knowledge_graph_in_neo4j
method. Parameters:
- URI: URI for the Neo4J data saving.
- user: Username for the Neo4J data saving.
- password: Password for the Neo4J data saving.
- knowledge_graph: The knowledge graph to save.
If the URI, username and password parameters are provided at the build_knowledge_graph
method, this function will automatically be called and the graph will be saved to neo4j.
kgb.store_knowledge_graph_in_neo4j(
URI="neo4j://127.0.0.1:7687",
user="neo4j",
password="password",
knowledge_graph=repograph
)
- Running the code to full repositories might take some time to process.
- If the repository has lots of PRs, it's recommended to use a github token during the initialization of
KnowledgeGraphBuilder
. Even with the token it might take a long time to query everything using the API (for test purposes, it's recommended to limit the maximum number of PRs to pull in detail).