This project is presented as spotlight in CVPR2018.
Humans have strong ability to make inferences about the appearance of the invisible and occluded parts of scenes. For example, when we look at the scene on the left we can make predictions about what is behind the coffee table, and can even complete the sofa based on the visible parts of the sofa, the coffee table, and what we know in general about sofas and coffee tables and how they occlude each other.
SeGAN can learn to
- Generate the appearance of the occluded parts of objects,
- Segment the invisible parts of objects,
- Although trained on synthetic photo realistic images reliably segment natural images,
- By reasoning about occluder-occludee relations infer depth layering.
If you find this project useful in your research, please consider citing:
@inproceedings{ehsani2018segan,
title={Segan: Segmenting and generating the invisible},
author={Ehsani, Kiana and Mottaghi, Roozbeh and Farhadi, Ali},
booktitle={CVPR},
year={2018}
}
- Using Torch 7 and dependencies from this repository.
- Linux OS
- NVIDIA GPU + CUDA + CuDNN
-
Clone the repository using the command:
git clone https://github.com/ehsanik/SeGAN cd SeGAN
-
Download the dataset from here and extract it.
-
Make a link to the dataset.
ln -s /PATH/TO/DATASET dyce_data
-
Download pretrained weights from here and extract it.
-
Make a link to the weights' folder.
ln -s /PATH/TO/WEIGHTS weights
We introduce DYCE, a dataset of synthetic occluded objects. This is a synthetic dataset with photo-realistic images and natural configuration of objects in scenes. All of the images of this dataset are taken in indoor scenes. The annotations for each image contain the segmentation mask for the visible and invisible regions of objects. The images are obtained by taking snapshots from our 3D synthetic scenes.
The number of the synthetic scenes that we use is 11, where we use 7 scenes for training and validation, and 4 scenes for testing. Overall there are 5 living rooms and 6 kitchens, where 2 living rooms and 2 kitchen are used for testing. On average, each scene contains 60 objects and the number of visible objects per image is 17.5 (by visible we mean having at least 10 visible pixels). There is no common object instance in train and test scenes.
The dataset can be downloaded from here.
To train your own model:
th main.lua -baseLR 1e-3 -end2end -istrain "train"
See data_settings.lua
for additional commandline options.
To test using the pretrained model and reproduce the results in the paper:
Model | Segmentation | Texture | |||
---|---|---|---|---|---|
Visible ∪ Invisible | Visible | Invisible | L1 | L2 | |
Multipath | 47.51 | 48.58 | 6.01 | - | - |
SeGAN(ours) w/ SVpredicted | 68.78 | 64.76 | 15.59 | 0.070 | 0.023 |
SeGAN(ours) w/ SVgt | 75.71 | 68.05 | 23.26 | 0.026 | 0.008 |
th main.lua -weights_segmentation "weights/segment" -end2end -weights_texture "weights/texture" -istrain "test" -predictedSV
For testing using the groundtruth visible mask as input instead of the predicted mask:
th main.lua -weights_segmentation "weights/segment_gt_sv" -end2end -weights_texture "weights/texture_gt_sv" -istrain "test"
Code for GAN network borrows heavily from pix2pix.