VED: Monocular Semantic Occupancy Grid Mapping with Convolutional Variational Encoder-Decoder Networks
September 2020
tl;dr: Use variational autoencoder for semantic occpuancy grid map prediction.
Variational encoder-decoder (VED) encodes the front-view visual information for the driving scene and subsequently decodes it into a BEV semantic occupancy grid.
The proposed method beats a vanilla SegNet (a relatively strong baseline for conventional semantic segmentation). There was a 2x1 pooling layer in order to accommodate the different aspect ratio of input and output.
GT generation uses disparity map from stereo matching. This process may be noisy.
- View transformation: VAE with sampling.
- Binary occupancy grid is a decades old concept, but semantic occupancy grid is more powerful and enables more efficient and reliable navigation.
- Variational AutoEncoder (VAE, or VED as referred to in this paper) forces the latent space to be a normal distribution. Thus we can add a KL-divergence loss to encourage the latent distribution to be a normal distribution. The paper mainly wants to exploits VED's sampling robustness to imperfect GT.
- VED exhibits intrinsic invariance wrt pitch and roll perturbations, compared to monocular baseline and flat ground assumption.
- It is more robust to pitch and roll perturbation. It can also generalize better to unseen scenario.
- The PCA components of the latent space does encode some interpretable results.
- Questions and notes on how to improve/revise the current work