Skip to content

Latest commit

 

History

History
117 lines (86 loc) · 7.36 KB

split_generator.md

File metadata and controls

117 lines (86 loc) · 7.36 KB

Split Generator

The Split Generator reads localized subgraph samples produced by Subgraph Sampler, and executes logic to split the data into training, validation and test sets. The semantics of which nodes and edges end up in which data split depends on the particular semantics of the splitting strategy.

Input

  • job_name (AppliedTaskIdentifier): which uniquely identifies an end-to-end task.
  • task_config_uri (Uri): Path which points to a "template" GbmlConfig proto yaml file.
  • resource_config_uri (Uri): Path which points to a GiGLResourceConfig yaml

Optional Development Args:

  • cluster_name (str): Optional param if you want to re-use a cluster for development
  • skip_cluster_delete (bool): Provide flag to skip automatic cleanup of dataproc cluster
  • debug_cluster_own_alias (str): Add alias to cluster

What does it do?

The Split Generator undertakes the following actions:

  • Reads frozen GbmlConfig proto yaml, which contains a pointer to an instance of a SplitStrategy class (see splitStrategyClsPath field of datasetConfig.splitGeneratorConfig), and an instance of an Assigner class (see assignerClsPath field of datasetConfig.splitGeneratorConfig). These classes house logic for constructs which dictate how to assign nodes and/or edges to different buckets, which are then utilized to assign these objects to training, validation and test sets accordingly. See the currently supported strategies in: scala/splitgenerator/src/main/scala/lib/split_strategies/*

    Custom arguments can also be passed into the SplitStrategy class (Assigner class) by including them in the splitStrategyArgs (assignerArgs) field(s) inside datasetConfig.splitGeneratorConfig section of GbmlConfig. Several standard configurations of SplitStrategy and corresponding Assigner classes are implemented already at a GiGL platform-level: transductive node classification, inductive node classification, and transductive link prediction split routines, as detailed here.

  • The component kicks off a Spark job which read samples produced by the Subgraph Sampler component, which are stored at URIs referenced inside the sharedConfig.flattenedGraphMetadata section of the frozen GbmlConfig. Note that depending on the taskMetadata in the GbmlConfig, the URIs will be housed under different keys in this section; for example, given the Node-anchor Based Link Prediction setting used in the sample frozen GbmlConfig MAU yaml, we can find the Subgraph Sampler outputs under the nodeAnchorBasedLinkPredictionOutput field. Upon reading the outputs from Subgraph Sampler, the Split Generator component executes methods defined in the provided SplitStrategy instance on each of the input samples. The pipeline writes out TFRecord samples with appropriate data meant to be visible in training, validation and test sets to GCS.

How do I run it?

Firstly, you can adjust the following parameters in the GbmlConfig:

  splitGeneratorConfig:
    assignerArgs:
      seed: '42'
      test_split: '0.2'
      train_split: '0.7'
      val_split: '0.1'
    assignerClsPath: splitgenerator.lib.assigners.TransductiveEdgeToLinkSplitHashingAssigner
    splitStrategyClsPath: splitgenerator.lib.split_strategies.TransductiveNodeAnchorBasedLinkPredictionSplitStrategy

Import GiGL

from gigl.src.split_generator.split_generator import SplitGenerator
from gigl.common import UriFactory
from gigl.src.common.types import AppliedTaskIdentifier

split_generator = SplitGenerator()

split_generator.run(
    applied_task_identifier=AppliedTaskIdentifier("my_gigl_job_name"),
    task_config_uri=UriFactory.create_uri("gs://my-temp-assets-bucket/task_config.yaml"),
    resource_config_uri=UriFactory.create_uri("gs://my-temp-assets-bucket/resource_config.yaml")
)

Command Line

python -m gigl.src.split_generator.split_generator \
  --job_name my_gigl_job_name \
  --task_config_uri "gs://my-temp-assets-bucket/task_config.yaml"
  --resource_config_uri="gs://my-temp-assets-bucket/resource_config.yaml"

The python entry point split_generator.py performs the following:

  • Create a Dataproc cluster suitable for the scale of the graph at hand,
  • Install Spark and Scala dependencies,
  • Run the Split Generator Spark job,
  • Delete the Dataproc cluster after the job is finished.

Optional Arguments: Provide a custom cluster name so you can re-use it instead of having to create a new one every time.

  --cluster_name="unique_name_for_the_cluster"

Ensure to skip deleting the cluster so it can be re-used. But, be sure to clean up manually after to prevent $ waste.

  --skip_cluster_delete

Marks cluster is to be used for debugging/development by the alias provided. i.e. for username some_user, provide debug_cluster_owner_alias="some_user"

  --debug_cluster_owner_alias="your_alias"

Example for when you would want to use cluster for development:

python -m gigl.src.split_generator.split_generator \
  --job_name my_gigl_job_name \
  --task_config_uri "gs://my-temp-assets-bucket/task_config.yaml"
  --resource_config_uri="gs://my-temp-assets-bucket/resource_config.yaml"
  --cluster_name="unique-name-for-the-cluster"\
  --skip_cluster_delete \
  --debug_cluster_owner_alias="$(whoami)"

Output

Upon completing the Dataflow job referenced in the last bullet point of the What Does it Do section, the Split Generator writes out TFRecord samples belonging to each of the training, validation and test sets to URIs which are referenced in sharedConfig.datasetMetadata section of the GbmlConfig. Based on the taskMetadata in the GbmlConfig, the outputs will be written to different keys within this section. Given the sample configs for the MAU task referenced here, they are written to URIs referenced at the NodeAnchorBasedLinkPredictionDataset field.

Custom Usage

  • To customize the semantics of the splitting method desired, users can manipulate arguments passed to existing Assigner and SplitStrategy class instances, or even write their own. The instances provided reflect "standard" splitting techniques in graph ML literature, which can be tricky to implement, so caution is advised in trying to customize or write modified variants, in order to avoid leaking data between training, validation and test sets.

  • Currently, all SplitStrategy instances leverage HashingAssigner (a specialized Assigner in which nodes / edges are assigned to different buckets randomly, reflecting random splits). In the future, we can consider introducing new Assigner policies to reflect temporal splitting.

Other

  • Design: Graph ML data splitting is tricky. Please see here for a good academic reference into how splitting is standardly conducted to avoid leakage. We chose to create abstractions around splitting which reflect flexible policies around assignment of nodes and/or edges to different buckets, from which defining the visible data during training, validation and testing becomes deterministic.

This component runs on Spark. Some info on monitoring this job:

  • The list of all jobs/clusters is available on Dataproc UI, and we can monitor the overall Spark job statuses and configurations.

  • While the cluster is running, we can access Spark UI's WEB INTERFACES tab to monitor each stage of the job in more detail.