The Split Generator reads localized subgraph samples produced by Subgraph Sampler, and executes logic to split the data into training, validation and test sets. The semantics of which nodes and edges end up in which data split depends on the particular semantics of the splitting strategy.
- job_name (AppliedTaskIdentifier): which uniquely identifies an end-to-end task.
- task_config_uri (Uri): Path which points to a "template"
GbmlConfig
proto yaml file. - resource_config_uri (Uri): Path which points to a
GiGLResourceConfig
yaml
Optional Development Args:
- cluster_name (str): Optional param if you want to re-use a cluster for development
- skip_cluster_delete (bool): Provide flag to skip automatic cleanup of dataproc cluster
- debug_cluster_own_alias (str): Add alias to cluster
The Split Generator undertakes the following actions:
-
Reads frozen
GbmlConfig
proto yaml, which contains a pointer to an instance of aSplitStrategy
class (seesplitStrategyClsPath
field ofdatasetConfig.splitGeneratorConfig
), and an instance of anAssigner
class (seeassignerClsPath
field ofdatasetConfig.splitGeneratorConfig
). These classes house logic for constructs which dictate how to assign nodes and/or edges to different buckets, which are then utilized to assign these objects to training, validation and test sets accordingly. See the currently supported strategies in:scala/splitgenerator/src/main/scala/lib/split_strategies/*
Custom arguments can also be passed into the
SplitStrategy
class (Assigner
class) by including them in thesplitStrategyArgs
(assignerArgs
) field(s) insidedatasetConfig.splitGeneratorConfig
section ofGbmlConfig
. Several standard configurations ofSplitStrategy
and correspondingAssigner
classes are implemented already at a GiGL platform-level: transductive node classification, inductive node classification, and transductive link prediction split routines, as detailed here. -
The component kicks off a Spark job which read samples produced by the Subgraph Sampler component, which are stored at URIs referenced inside the
sharedConfig.flattenedGraphMetadata
section of the frozenGbmlConfig
. Note that depending on thetaskMetadata
in theGbmlConfig
, the URIs will be housed under different keys in this section; for example, given the Node-anchor Based Link Prediction setting used in the sample frozenGbmlConfig
MAU yaml, we can find the Subgraph Sampler outputs under thenodeAnchorBasedLinkPredictionOutput
field. Upon reading the outputs from Subgraph Sampler, the Split Generator component executes methods defined in the provided SplitStrategy instance on each of the input samples. The pipeline writes out TFRecord samples with appropriate data meant to be visible in training, validation and test sets to GCS.
Firstly, you can adjust the following parameters in the GbmlConfig
:
splitGeneratorConfig:
assignerArgs:
seed: '42'
test_split: '0.2'
train_split: '0.7'
val_split: '0.1'
assignerClsPath: splitgenerator.lib.assigners.TransductiveEdgeToLinkSplitHashingAssigner
splitStrategyClsPath: splitgenerator.lib.split_strategies.TransductiveNodeAnchorBasedLinkPredictionSplitStrategy
Import GiGL
from gigl.src.split_generator.split_generator import SplitGenerator
from gigl.common import UriFactory
from gigl.src.common.types import AppliedTaskIdentifier
split_generator = SplitGenerator()
split_generator.run(
applied_task_identifier=AppliedTaskIdentifier("my_gigl_job_name"),
task_config_uri=UriFactory.create_uri("gs://my-temp-assets-bucket/task_config.yaml"),
resource_config_uri=UriFactory.create_uri("gs://my-temp-assets-bucket/resource_config.yaml")
)
Command Line
python -m gigl.src.split_generator.split_generator \
--job_name my_gigl_job_name \
--task_config_uri "gs://my-temp-assets-bucket/task_config.yaml"
--resource_config_uri="gs://my-temp-assets-bucket/resource_config.yaml"
The python entry point split_generator.py
performs the following:
- Create a Dataproc cluster suitable for the scale of the graph at hand,
- Install Spark and Scala dependencies,
- Run the Split Generator Spark job,
- Delete the Dataproc cluster after the job is finished.
Optional Arguments: Provide a custom cluster name so you can re-use it instead of having to create a new one every time.
--cluster_name="unique_name_for_the_cluster"
Ensure to skip deleting the cluster so it can be re-used. But, be sure to clean up manually after to prevent $ waste.
--skip_cluster_delete
Marks cluster is to be used for debugging/development by the alias provided. i.e. for username some_user, provide debug_cluster_owner_alias="some_user"
--debug_cluster_owner_alias="your_alias"
Example for when you would want to use cluster for development:
python -m gigl.src.split_generator.split_generator \
--job_name my_gigl_job_name \
--task_config_uri "gs://my-temp-assets-bucket/task_config.yaml"
--resource_config_uri="gs://my-temp-assets-bucket/resource_config.yaml"
--cluster_name="unique-name-for-the-cluster"\
--skip_cluster_delete \
--debug_cluster_owner_alias="$(whoami)"
Upon completing the Dataflow job referenced in the last bullet point of the What Does it Do section, the Split Generator writes out TFRecord samples belonging to each of the training, validation and test sets to URIs which are referenced in sharedConfig.datasetMetadata
section of the GbmlConfig
. Based on the taskMetadata
in the GbmlConfig
, the outputs will be written to different keys within this section. Given the sample configs for the MAU task referenced here, they are written to URIs referenced at the NodeAnchorBasedLinkPredictionDataset
field.
-
To customize the semantics of the splitting method desired, users can manipulate arguments passed to existing
Assigner
andSplitStrategy
class instances, or even write their own. The instances provided reflect "standard" splitting techniques in graph ML literature, which can be tricky to implement, so caution is advised in trying to customize or write modified variants, in order to avoid leaking data between training, validation and test sets. -
Currently, all
SplitStrategy
instances leverageHashingAssigner
(a specializedAssigner
in which nodes / edges are assigned to different buckets randomly, reflecting random splits). In the future, we can consider introducing newAssigner
policies to reflect temporal splitting.
- Design: Graph ML data splitting is tricky. Please see here for a good academic reference into how splitting is standardly conducted to avoid leakage. We chose to create abstractions around splitting which reflect flexible policies around assignment of nodes and/or edges to different buckets, from which defining the visible data during training, validation and testing becomes deterministic.
This component runs on Spark. Some info on monitoring this job:
-
The list of all jobs/clusters is available on Dataproc UI, and we can monitor the overall Spark job statuses and configurations.
-
While the cluster is running, we can access Spark UI's WEB INTERFACES tab to monitor each stage of the job in more detail.