Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA out of memory with 6M nodes, 8M edges on A100 GPU #370

Open
chi2liu opened this issue Aug 8, 2022 · 1 comment
Open

Comments

@chi2liu
Copy link

chi2liu commented Aug 8, 2022

🐛 Bug

|-------------------------------------------------------------------------------------------------------|
    *** Running (`tmp_data.pt`, `unsup_graphsage`, `node_classification_dw`, `unsup_graphsage_mw`)
|-------------------------------------------------------------------------------------------------------|
Model Parameters: 1568
  0%|                                                                                | 0/500 [00:00<?, ?it/s]OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
  0%|                                                                                | 0/500 [00:47<?, ?it/s]
Traceback (most recent call last):
  File "generate_emb.py", line 12, in <module>
    outputs = generator(edge_index, x=x)
  File "/home/chiliu/miniconda3/envs/cogdl/lib/python3.7/site-packages/cogdl/pipelines.py", line 204, in __call__
    model = train(self.args)
  File "/home/chiliu/miniconda3/envs/cogdl/lib/python3.7/site-packages/cogdl/experiments.py", line 216, in train
    result = trainer.run(model_wrapper, dataset_wrapper)
  File "/home/chiliu/miniconda3/envs/cogdl/lib/python3.7/site-packages/cogdl/trainer/trainer.py", line 188, in run
    self.train(self.devices[0], model_w, dataset_w)
  File "/home/chiliu/miniconda3/envs/cogdl/lib/python3.7/site-packages/cogdl/trainer/trainer.py", line 334, in train
    training_loss = self.train_step(model_w, train_loader, optimizers, lr_schedulers, rank, scaler)
  File "/home/chiliu/miniconda3/envs/cogdl/lib/python3.7/site-packages/cogdl/trainer/trainer.py", line 468, in train_step
    loss = model_w.on_train_step(batch)
  File "/home/chiliu/miniconda3/envs/cogdl/lib/python3.7/site-packages/cogdl/wrappers/model_wrapper/base_model_wrapper.py", line 73, in on_train_step
    return self.train_step(*args, **kwargs)
  File "/home/chiliu/miniconda3/envs/cogdl/lib/python3.7/site-packages/cogdl/wrappers/model_wrapper/node_classification/unsup_graphsage_mw.py", line 43, in train_step
    neg_loss = -torch.log(torch.sigmoid(-torch.sum(x.unsqueeze(1).repeat(1, self.num_negative_samples, 1) * x[self.negative_samples], dim=-1))).mean()
RuntimeError: CUDA out of memory. Tried to allocate 11.02 GiB (GPU 0; 39.45 GiB total capacity; 29.23 GiB already allocated; 8.01 GiB free; 30.03 GiB reserved in total by PyTorch)

To Reproduce

Steps to reproduce the behavior:

from cogdl import pipeline
# build a pipeline for generating embeddings using unsupervised GNNs
# pass model name and num_features with its hyper-parameters to this API
import pandas as pd
graph = pd.read_csv("G1.weighted.edgelist", header=None,  sep=' ')
edge_index = graph[[0,1]].to_numpy()
edge_weight = graph[[2]].to_numpy(dtype=np.float16)
e = pd.read_csv("vertex_embeddings.csv", header=None, sep=' ')
x = e.iloc[:, :32].to_numpy(dtype=np.float16)
generator = pipeline("generate-emb", model="unsup_graphsage", no_test=True, num_features=32, hidden_size=16, walk_length=2, sample_size=[4, 2], is_large=True)
outputs = generator(edge_index, x=x)
pd.DataFrame("embeddings.csv")

the graph is 6M nodes, 8M edges on A100 GPU 40Gb

Expected behavior

Environment

  • CogDL version: 0.5.3
  • OS (e.g., Linux): ubuntu
  • Python version: 3.7
  • PyTorch version: 1.9.1.post3
  • CUDA/cuDNN version (if applicable): 11.7
  • Any other relevant information:

Additional context

@cenyk1230
Copy link
Member

Hi @chi2liu,

Thanks for your interest in CogDL. It seems that the unsupervised graphsage uses full-batch training. We are checking for this issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants