Skip to content

NAN Loss in Training Stage-1 VAE All Class Model #83

@amogh-tiwari

Description

@amogh-tiwari

Hi @ZENGXH,

Thanks a lot for your very interesting work, and for making it open source.

I’m trying to re-train the stage-1 all-class VAE and had a few questions.

  1. NaNs during stage-1 all-class training
    I’m running into NaNs after 29 epochs (and it started to diverge after 16 epochs) when training the stage-1 all-class VAE (log file here)..
    I used the command suggested in the config, with all default settings (default config file, batch size 32), on 4×A100 GPUs, using the ShapeNetCore15K data from PointFlow. I’m currently experimenting with some of the hyperparameter modifications suggested in NaN loss while training stage 1 VAE #47 , but I’m a bit surprised to see NaNs even with the default setup. Do you have any other suggestions about this?

  2. Motivation behind normalize_shape_global=False and normalize_shape_box=True
    I’m curious about the design choice of using normalize_shape_global=False and normalize_shape_box=True.
    My eventual goal is to obtain size-and-shape-aware embeddings, so I’m wondering what happens if I instead train with: normalize_shape_global=True, normalize_shape_box=False?

    Does training still work in this case (perhaps with slightly worse reconstruction quality), or does it tend to break down entirely? Any intuition here would be very helpful.

  3. Training time for the all-class model
    I’m currently seeing ~20 epochs/hour on 4×A100s. With the default config specifying ~8000 epochs, this implies a fairly long training time.
    In the supplementary material, I could only find training time details for the single-class model, not the all-class model.
    Could you share roughly how long the all-class stage-1 model took to train in your setup?

  4. Viewing training / validation curves
    Is there a built-in way to view training/validation plots (e.g., loss curves)?
    At the moment, I only see checkpoints and log files being saved, but no plots. Is there a flag or logger option (e.g., TensorBoard/W&B) that I might be missing?

  5. Follow-Up Works
    I was also wondering whether there are any interesting follow-up works to LION—either by you or by other groups—that you’re aware of.

    In particular, I’m interested in approaches involving a multi-class VAE trained with a reconstruction objective, which also incorporates size information. My goal is to have some pretrained multi-class size-and-shape aware embeddings, which i can then use for a downstream task.

Thanks again for releasing the code and for any guidance you can share.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions