NAN Loss in Training Stage-1 VAE All Class Model

Hi @ZENGXH,

Thanks a lot for your very interesting work, and for making it open source.

I’m trying to re-train the **stage-1 all-class VAE** and had a few questions. 

1. **NaNs during stage-1 all-class training**
   I’m running into NaNs after 29 epochs (and it started to diverge after 16 epochs) when training the stage-1 all-class VAE (log file [here).](https://drive.google.com/file/d/1kHTgiLCc6TJ8wXrXmSg9t30NQcmDP8W1/view?usp=sharing).
   I used the command suggested in the config, with *all default settings* (default config file, batch size 32), on **4×A100 GPUs**, using the **ShapeNetCore15K data from PointFlow**.  I’m currently experimenting with some of the hyperparameter modifications suggested in #47 , but I’m a bit surprised to see NaNs even with the default setup. Do you have any other suggestions about this?

2. **Motivation behind `normalize_shape_global=False` and `normalize_shape_box=True`**
   I’m curious about the design choice of using `normalize_shape_global=False` and  `normalize_shape_box=True`.
   My eventual goal is to obtain **size-and-shape-aware embeddings**, so I’m wondering what happens if I instead train with: `normalize_shape_global=True`, `normalize_shape_box=False`?

   Does training still work in this case (perhaps with slightly worse reconstruction quality), or does it tend to break down entirely? Any intuition here would be very helpful.

3. **Training time for the all-class model**
   I’m currently seeing ~20 epochs/hour on 4×A100s. With the default config specifying ~8000 epochs, this implies a fairly long training time.
   In the supplementary material, I could only find training time details for the *single-class* model, not the all-class model.
   Could you share roughly how long the all-class stage-1 model took to train in your setup?

4. **Viewing training / validation curves**
   Is there a built-in way to view training/validation plots (e.g., loss curves)?
   At the moment, I only see checkpoints and log files being saved, but no plots. Is there a flag or logger option (e.g., TensorBoard/W&B) that I might be missing?

5. **Follow-Up Works**
    I was also wondering whether there are any interesting follow-up works to LION—either by you or by other groups—that you’re aware of.

    In particular, I’m interested in approaches involving a multi-class VAE trained with a reconstruction objective, which also incorporates size information. My goal is to have some pretrained multi-class size-and-shape aware embeddings, which i can then use for a downstream task.

Thanks again for releasing the code and for any guidance you can share.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NAN Loss in Training Stage-1 VAE All Class Model #83

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NAN Loss in Training Stage-1 VAE All Class Model #83

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions