GraphLearn-for-PyTorch(GLT) is a graph learning library for PyTorch that makes distributed GNN training and inference easy and efficient. It leverages the power of GPUs to accelerate graph sampling and utilizes UVA to reduce the conversion and copying of features of vertices and edges. For large-scale graphs, it supports distributed training on multiple GPUs or multiple machines through fast distributed sampling and feature lookup. Additionally, it provides flexible deployment for distributed training to meet different requirements.
- 
GPU acceleration GLT provides both CPU-based and GPU-based graph operators such as neighbor sampling, negative sampling, and feature lookup. For GPU training, GPU-based graph operations accelerate the computation and reduce data movement by a considerable amount. 
- 
Scalable and efficient distributed training For distributed training, we implement multi-processing asynchronous sampling, pin memory buffer, hot feature cache, and use fast networking technologies (PyTorch RPC with RDMA support) to speed up distributed sampling and reduce communication. As a result, GLT can achieve high scalability and support graphs with billions of edges. 
- 
Easy-to-use API Most of the APIs of GLT are compatible with PyG/PyTorch, so you can only need to modify a few lines of PyG's code to get the acceleration of the program. For GLT specific APIs, they are compatible with PyTorch and there is complete documentation and usage examples available. 
- 
Large-scale real-world GNN models We focus on real-world scenarios and provide distributed GNN training examples on large-scale graphs. Since GLT is compatible with PyG, you can use almost any PyG's model as the base model. We will also continue to provide models with practical effects in industrial scenarios. 
- 
Easy to extend GLT directly uses PyTorch C++ Tensors and is easy to extend just like PyTorch. There are no extra restrictions for CPU or CUDA based graph operators, and adding a new one is straightforward. For distributed operations, you can write a new one in Python using PyTorch RPC. 
- 
Flexible deployment Graph Engine(Graph operators) and PyTorch engine(PyTorch nn modules) can be deployed either co-located or separated on different machines. This flexibility enables you to deploy GLT to your own environment or embed it in your project easily. 
The main goal of GLT is to leverage hardware resources like GPU/NVLink/RDMA and characteristics of GNN models to accelerate end-to-end GNN training in both the single-machine and distributed environments.
In the case of multi-GPU training, graph sampling and CPU-GPU data transferring could easily become the major performance bottleneck. To speed up graph sampling and feature lookup, GLT implements the Unified Tensor Storage to unify the memory management of CPU and GPU. Based on this storage, GLT supports both CPU-based and GPU-based graph operators such as neighbor sampling, negative sampling, feature lookup, subgraph sampling etc. To alleviate the CPU-GPU data transferring overheads incurred by feature collection, GLT supports caching features of hot vertices in GPU memory, and accessing the remaining feature data (stored in pinned memory) via UVA. We further utilize the high-speed NVLink between GPUs expand the capacity of GPU cache.
As for distributed training, to prevent remote data access from blocking the progress of model training, GLT implements an efficient RPC framework on top of PyTorch RPC and adopts asynchronous graph sampling and feature lookup operations to hide the network latency and boost the end-to-end training throughput.
To lower the learning curve for PyG users, the APIs of key abstractions in GLT, such as dataset and dataloader, are designed to be compatible with PyG. Thus PyG users can take full advantage of GLT's acceleration capabilities by only modifying very few lines of code.
For model training, GLT supports different models to fit different scales of real-world graphs. It allows users to collocate model training and graph sampling (including feature lookup) in the same process, or separate them into different processes or even different machines. We provide two example to illustrate the training process on small graphs: single GPU training example and multi-GPU training example. For large-scale graphs, GLT separates sampling and training processes for asynchronous and parallel acceleration, and supports deployment of sampling and training processes on the same or different machines. Examples of distributed training can be found in distributed examples.
- cuda
- python>=3.6
- torch(PyTorch)
- torch_geometric, torch_scatter, torch_sparse. Please refer to PyG for installation.
# glibc>=2.14, torch>=1.13
pip install graphlearn-torch
git submodule update --init
sh install_dependencies.sh- Build
python setup.py bdist_wheel
pip install dist/*Build in CPU-mode
WITH_CUDA=OFF python setup.py bdist_wheel
pip install dist/*- UT
sh scripts/run_python_ut.shIf you need to test C++ operations, you can only build the C++ part.
- Build
cmake .
make -j- UT
sh scripts/run_cpp_ut.shLet's take PyG's GraphSAGE on OGBN-Products
as an example, you only need to replace PyG's torch_geometric.loader.NeighborSampler
by the graphlearn_torch.loader.NeighborLoader
to benefit from the the acceleration of model training using GLT.
import torch
import graphlearn_torch as glt
import os.path as osp
from ogb.nodeproppred import PygNodePropPredDataset
# PyG's original code preparing the ogbn-products dataset
root = osp.join(osp.dirname(osp.realpath(__file__)), '..', 'data', 'products')
dataset = PygNodePropPredDataset('ogbn-products', root)
split_idx = dataset.get_idx_split()
data = dataset[0]
# Enable GLT acceleration on PyG requires only replacing
# PyG's NeighborSampler with the following code.
glt_dataset = glt.data.Dataset()
glt_dataset.build(edge_index=data.edge_index,
                  feature_data=data.x,
                  sort_func=glt.data.sort_by_in_degree,
                  split_ratio=0.2,
                  label=data.y,
                  device=0)
train_loader = glt.loader.NeighborLoader(glt_dataset,
                                         [15, 10, 5],
                                         split_idx['train'],
                                         batch_size=1024,
                                         shuffle=True,
                                         drop_last=True,
                                         as_pyg_v1=True)The complete example can be found in examples/train_sage_ogbn_products.py.
While building the glt_dataset, the GPU where the graph sampling operations
are performed is specified by parameter device. By default, the graph topology are stored
in pinned memory for ZERO-COPY access. Users can also choose to stored the graph
topology in GPU by specifying graph_mode='CUDA in graphlearn_torch.data.Dataset.build.
The split_ratio determines the fraction of feature data to be cached in GPU.
By default, GLT sorts the vertices in descending order according to vertex indegree
and selects vetices with higher indegree for feature caching. The default sort
function used as the input parameter for
graphlearn_torch.data.Dataset.build is
graphlearn_torch.data.reorder.sort_by_in_degree. Users can also customize their own sort functions with compatible APIs.
For PyTorch DDP distributed training, there are usually several steps as follows:
First, load the graph and feature from partitions.
import torch
import os.path as osp
import graphlearn_torch as glt
# load from partitions and create distributed dataset.
# Partitions are generated by following script:
# `python partition_ogbn_dataset.py --dataset=ogbn-products --num_partitions=2`
root = osp.join(osp.dirname(osp.realpath(__file__)), '..', '..', 'data', 'products')
glt_dataset = glt.distributed.DistDataset()
glt_dataset.load(
  num_partitions=2,
  partition_idx=int(os.environ['RANK']),
  graph_dir=osp.join(root, 'ogbn-products-graph-partitions'),
  feature_dir=osp.join(root, 'ogbn-products-feature-partitions'),
  label_file=osp.join(root, 'ogbn-products-label', 'label.pt') # whole label
)
train_idx = torch.load(osp.join(root, 'ogbn-products-train-partitions',
                                'partition' + str(os.environ['RANK']) + '.pt'))Second, create distributed neighbor loader based on the dataset above.
# distributed neighbor loader
train_loader = glt.distributed.DistNeighborLoader(
  data=glt_dataset,
  num_neighbors=[15, 10, 5],
  input_nodes=train_idx,
  batch_size=batch_size,
  drop_last=True,
  collect_features=True,
  to_device=torch.device(rank % torch.cuda.device_count()),
  worker_options=glt.distributed.MpDistSamplingWorkerOptions(
    num_workers=nsampling_proc_per_train,
    worker_devices=[torch.device('cuda', (i + rank) % torch.cuda.device_count())
                    for i in range(nsampling_proc_per_train)],
    worker_concurrency=4,
    master_addr='localhost',
    master_port=12345, # different from port in pytorch training.
    channel_size='2GB',
    pin_memory=True
  )
)Finally, define DDP model and run.
from torch.nn.parallel import DistributedDataParallel
from torch_geometric.nn import GraphSAGE
# DDP model
model = GraphSAGE(
  in_channels=num_features,
  hidden_channels=256,
  num_layers=3,
  out_channels=num_classes,
).to(rank)
model = DistributedDataParallel(model, device_ids=[rank])
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
# training.
for epoch in range(0, epochs):
  model.train()
  for batch in train_loader:
    optimizer.zero_grad()
    out = model(batch.x, batch.edge_index)[:batch.batch_size].log_softmax(dim=-1)
    loss = F.nll_loss(out, batch.y[:batch.batch_size])
    loss.backward()
    optimizer.step()
  dist.barrier()The training scripts for 2 nodes each with 2 GPUs are as follows:
# node 0:
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --use_env --nnodes=2 --node_rank=0 --master_addr=xxx dist_train_sage_supervised.py
# node 1:
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --use_env --nnodes=2 --node_rank=1 --master_addr=xxx dist_train_sage_supervised.pyFull code can be found in distributed training example.
