Skip to content

GiGL is an open-source library for training and inference of Graph Neural Networks at very large (billion) scale.

License

Notifications You must be signed in to change notification settings

snap-research/GiGL

Repository files navigation

GiGL Logo

GiGL: Gigantic Graph Learning

This REPO is a bit dusty while we prepare for the OSS launch. We will have it in working order coming soon.

GiGL is an open-source library for training and inference of Graph Neural Networks at very large (billion) scale.

See 📖 Documentation for more details

Key Features 🌟

  • 🧠 Versatile GNN Applications: Supports easy customization in using GNNs in supervised and unsupervised ML applications like node classification and link prediction.

  • 🚀 Designed for Scalability: The architecture is built with horizontal scaling in mind, ensuring cost-effective performance throughout the process of data preprocessing and transformation, model training, and inference.

  • 🎛️ Easy Orchestration: Simplified end-to-end orchestration, making it easy for developers to implement, scale, and manage their GNN projects.


GiGL Components ⚡️

GiGL contains six components, each designed to facilitate the platforms end-to-end graph machine learning (ML) tasks. The components are as follows:

Component Source Code Documentation
Config Populator here here
Data Preprocessor here here
Subgraph Sampler here here
Split Generator here here
Trainer here here
Inferencer here here

The figure below illustrates at a high level how all the components work together for and end-to-end GiGL pipeline.

gigl-framework

Installation ⚙️

There are various ways to use GiGL. The recommended solution is to set up a conda environment and use some handy commands:

From the root directory:

make initialize_environment
conda activate gnn

This creates a Python 3.9 environment with some basic utilities. Next, to install all user dependencies:

make install_deps

If you instead want a developer-install which includes some extra tooling useful for contributions:

make install_dev_deps
Local Repo Setup

For developing on GiGL see our development guide and contribution guidelines

Using Docker todo

Configuration 📄

Before getting started with running components in GiGL, it’s important to set up your config files. These are necessary files required for each component to operate. The two required files are:

  • Resource Config: Details the resource allocation and environmental settings across all GiGL components. This encompasses shared resources for all components, as well as component-specific settings.

  • Task Config: Specifies task-related configurations, guiding the behavior of components according to the needs of your machine learning task.

To configure these files and customize your GiGL setup, follow our step-by-step guides:

Usage 🚀

GiGL offers 3 primiary methods of usage to run the components for your graph machine learning tasks.

1. Importable gigl

To easily get started or incorporate gigl into your existing workflows, you can simply import gigl and call the .run() method on its components.

Example
from gigl.src.training.trainer import Trainer

trainer = Trainer()
trainer.run(task_config_uri, resource_config_uri, job_name)

2. Command-Line Execution

Each GiGL component can be executed as a standalone module from the command line. This method is useful for batch processing or when integrating into shell scripts.

Example
python -m \
    gigl.src.training.trainer \
    --job_name your_job_name \
    --task_config_uri gs://your_project_bucket/task_config.yaml \
    --resource_config_uri "gs://your_project_bucket/resource_conifg.yaml"

3. Kubeflow Pipeline Orchestration

GiGL also supports pipeline orchestration using Kubeflow. This allows you to easily kick off an end-to-end run with little to no code. See Kubeflow Orchestration for more information


The best way to get more familiar with GiGL is to go through the various examples or for specific details see our user guide.

Tests 🔧

Testing in GiGL is designed to ensure reliability and robustness across different components of the library. We support three types of tests: unit tests, local integration tests, and cloud integration end-to-end tests.

Unit Tests

GiGL's unit tests focus on validating the functionality of individual components and high-level utilities. They also check for proper formatting, typing, and linting standards.

More Details
  • No external assets or a GCP project are required.
  • Unit tests run on every commit on a pull request via Github Actions

To run unit tests locally, execute the following command:

# Runs both Scala and Python unit tests.
make unit_test

# Runs just Python unit tests
make unit_test_py

# Runs just Scala unit tests
make unit_test_scala

Local Integration Test

GiGL's local integration tests simulate the pipeline behavior of GiGL components. These tests are crucial for verifying that components function correctly in sequence and that outputs from one component are correctly handled by the next.

More Details
  • Utilizes mocked/synthetic data publicly hosted in GCS (see: Public Assets)
  • Require access and run on cloud services such as BigQuery, Dataflow etc.
  • Required to pass before merging PR (Pre-merge check)

To run integration tests locally, you need to provide yur own resource config and run the following command:

make integration_test resource_config_uri="gs://your-project-bucket/resource_config.yaml"

Cloud Integration Test (End-to-End)

Cloud integration tests run a full end-to-end GiGL pipeline within GCP, also leveraging cloud services such as Dataflow, Dataproc, and Vertex AI.

More Details
  • Utilizes mocked/synthetic data publicly hosted in GCS (see: Public Assets)
  • Require access and run on cloud services such as BigQuery, Dataflow etc.
  • Required to pass before merging PR (Pre-merge check). Access to the orchestration, logs, etc., is restricted to authorized internal engineers to maintain security. Failures will be reported back to contributor as needed.

To test cloud integration test functionality, you can replicate by running and end-to-end pipeline by following along one of our Cora examples (See: Examples)


Contribution 🔥

Your contributions are always welcome and appreciated. The following are the things you can do to contribute to this project.

  1. Report a bug
    If you think you have encountered a bug please feel free to report it here and someone from the team will take a look.

  2. Request a feature
    Feature requests are always welcome! You can request a feature by adding it here

  3. Create a pull request
    Pull request are always greatly appreciated. You can get started by picking up any open issues from here and making a pull request.

If you are new to open-source, make sure to check read more about it here and learn more about creating a pull request here.

For more information, see our Contributing Guide

Additional Resources ❗

You may still have unanswered questions or may be facing issues. If so please see our FAQ or our User Guide for further guidence.

License 🔒

MIT License

About

GiGL is an open-source library for training and inference of Graph Neural Networks at very large (billion) scale.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published