Skip to content

torch.cuda.OutOfMemoryError: CUDA out of memory #7

@JiaLonghao1997

Description

@JiaLonghao1997

我们注意到,您在论文中使用了ogbn-Arxiv和ogbn-Papers100M等大型数据集,但是在我们自己的服务器上测试时,发现内存超了。
请问您ogbn-Arxiv数据集上训练过程中,内存消耗大概是多少?
关于运行GraphMAE2或者处理百万节点级别的大规模图,您有什么建议吗?

(graphmb) [jialh@gpu07 GraphMAE2]$ sh 01run_ogbn-arxiv.sh
2024-04-22 21:08:20,721 - INFO - ----- Using best configs from configs/ogbn-arxiv.yaml -----
Namespace(seeds=[0], dataset='ogbn-arxiv', device=0, max_epoch=60, warmup_steps=-1, num_heads=8, num_out_heads=1, num_layers=4, num_dec_layers=1, num_remasking=3, num_hidden=1024, residual=True, in_drop=0.2, attn_drop=0.1, norm='layernorm', lr=0.0025, weight_decay=0.06, negative_slope=0.2, activation='prelu', mask_rate=0.5, remask_rate=0.5, remask_method='random', mask_type='mask', mask_method='random', drop_edge_rate=0.5, drop_edge_rate_f=0.0, encoder='gat', decoder='gat', loss_fn='sce', alpha_l=6, optimizer='adamw', max_epoch_f=1000, lr_f=0.005, weight_decay_f=0.0001, linear_prob=True, no_pretrain=False, load_model=False, checkpoint_path=None, use_cfg=True, logging=False, scheduler=True, batch_size=512, batch_size_f=256, sampling_method='lc', label_rate=1.0, ego_graph_file_path='./lc_ego_graphs/ogbn-arxiv-lc-ego-graphs-256.pt', data_dir='./dataset', lam=10.0, full_graph_forward=False, delayed_ema_epoch=40, replace_rate=0.0, momentum=0.996)
2024-04-22 21:08:21,362 - INFO - Before loading data, occupied memory: 353.75 MB
2024-04-22 21:08:21,362 - INFO - ego_graph_file_path: ./lc_ego_graphs/ogbn-arxiv-lc-ego-graphs-256.pt
2024-04-22 21:08:21,678 - INFO - --- to undirected graph ---
2024-04-22 21:08:22,452 - INFO - ### scaling features ###
2024-04-22 21:08:25,297 - INFO - After loading data, occupied memory: 968.62 MB
=== Use sce_loss and alpha_l=6 ===
num_encoder_params: 3428356, num_decoder_params: 131456, num_params_in_total: 6184710
2024-04-22 21:08:25,420 - INFO - ---- start pretraining ----
2024-04-22 21:08:25,420 - INFO - start training..
2024-04-22 21:08:26,714 - INFO - After creating dataloader: Memory: 1856.79 MB
2024-04-22 21:08:26,715 - INFO - Use scheduler
  0%|                                                                                                                                        | 0/331 [00:12<?, ?it/s]
Traceback (most recent call last):
  File "/public/home/jialh/metaHiC/models/GraphMAE2/main_large.py", line 222, in <module>
    model = pretrain(model, feats, graph, pretrain_ego_graph_nodes, max_epoch=max_epoch,
  File "/public/home/jialh/metaHiC/models/GraphMAE2/main_large.py", line 120, in pretrain
    loss = model(batch_g, x, targets, epoch, drop_g1, drop_g2)
  File "/home1/jialh/tools/anaconda3/envs/mamba/envs/graphmb/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/public/home/jialh/metaHiC/models/GraphMAE2/models/edcoder.py", line 231, in forward
    loss = self.mask_attr_prediction(g, x, targets, epoch, drop_g1, drop_g2)
  File "/public/home/jialh/metaHiC/models/GraphMAE2/models/edcoder.py", line 243, in mask_attr_prediction
    latent_target = self.encoder_ema(drop_g2, x,)
  File "/home1/jialh/tools/anaconda3/envs/mamba/envs/graphmb/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/public/home/jialh/metaHiC/models/GraphMAE2/models/gat.py", line 76, in forward
    h = self.gat_layers[l](g, h)
  File "/home1/jialh/tools/anaconda3/envs/mamba/envs/graphmb/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/public/home/jialh/metaHiC/models/GraphMAE2/models/gat.py", line 282, in forward
    rst = rst + self.bias.view(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 508.00 MiB (GPU 0; 10.91 GiB total capacity; 9.88 GiB already allocated; 52.06 MiB free; 10.19 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions