Official implementation of: Object-Centric Image to Video Generation with Language Guidance by Angel Villar-Corrales, Gjergj Plepi and Sven Behnke. ArXiv Preprint. 2025.
[Paper
]
[Project Page
]
[BibTeX
]
TextOCVP
![]() |
Text-Conditioned Predictor
![]() |
- Clone the repository and install all required packages including in our
conda
environment:
git clone git@github.com:angelvillar96/TextOCVP.git
cd TextOCVP
conda env create -f environment.yml
- Download and extract the pretrained models, including checkpoints for the SAVi decomposition, predictor modules and behaviour modules:
chmod +x download_pretrained.sh
./download_pretrained.sh
- Download the datasets and place them in the
datasets
folder.
-
CATER: You can download the CATER dataset from the links provided in the original MAGE repository: CATER Easy CATER Hard
-
CLIPort: Contact the authors at
villar@ais.uni-bonn.de
to get access to this data.
We refer to docs/TRAIN.md for detailed instructions for training TextOCVP. We include instructions for all training stages, including training the Object-Centric Video Decomposition model, as well as training the Text-Conditioned predictor.
Additionally, we provide instructions on how to support your own dataset.
We provide bash scripts for evaluating and generating figures using our pretrained checkpoints.
Simply run the bash scripts by:
./scripts/SCRIPT_NAME
Example:
./scripts/05_evaluate_TextOCVP_CATER.sh
./scripts/06_generate_figs_pred_CATER.sh
./scripts/05_evaluate_TextOCVP_CLIPORT.sh
./scripts/06_generate_figs_pred_CLIPORT.sh
Below we discuss in more detail the different evaluation and figure generation scripts and processes.
You can quantitatively and qualitatively evaluate an object-centric video decomposition model, i.e. SAVi or ExtendedDINOSAUR, using the src/03_evaluate_decomp_model.py
and src/06_generate_figs_decomp_model.py
scripts, respectively.
This scrips will evaluate the model on the test set and generate figures for the results.
Example:
python src/03_evaluate_decomp_model.py \
-d experiments/TextOCVP_CLIPort/ \
--decomp_ckpt ExtendedDINOSAUR_CLIPort.pth \
--results_name results_DecompModel
python src/06_generate_figs_decomp_model.py \
-d experiments/TextOCVP_CLIPort/ \
--decomp_ckpt ExtendedDINOSAUR_CLIPort.pth \
--num_seqs 10
You can quantitatively evaluate TextOCVP for video prediction using the src/05_evaluate_predictor.py
script.
This script takes pretrained Object-Centric Decomposition and TextOCVP checkpoints and evaluates the visual quality of the predicted frames.
Example:
python src/05_evaluate_predictor.py \
-d experiments/TextOCVP_CLIPort/ \
--decomp_ckpt ExtendedDINOSAUR_CLIPort.pth \
--name_pred_exp TextOCVP \
--pred_ckpt TextOCVP_CLIPort.pth \
--results_name results_TextOCVP_NumSeed=1_NumPreds=19 \
--num_seed 1 \
--num_preds 19 \
--batch_size 8
Similarly, you can qualitatively evaluate the models using the src/06_generate_figs_predictor.py
script.
Example:
python src/06_generate_figs_predictor.py \
-d experiments/TextOCVP_CLIPort/ \
--decomp_ckpt ExtendedDINOSAUR_CLIPort.pth \
--name_pred_exp TextOCVP \
--pred_ckpt TextOCVP_CLIPort.pth \
--num_preds 29 \
--num_seqs 10
Show Example Outputs of `src/06_generate_figs_predictor.py`
Generating figures with TextOCVP should produce animations as follows:















If you consider our work interesting, you may also want to check out our related works:
- OCVP: Object-Centric Video Prediction via decoupling object dynamics and interactions.
- SOLD: Model-based reinforcement learning with object-centric representations.
- PlaySlot: Learning inverse dynamics for controllable object-centric video prediction and planning.
This repository is maintained by Angel Villar-Corrales and Gjergj Plepi.
Please consider citing our paper if you find our work or repository helpful.
@article{villar_TextOCVP_2025,
title={Object-Centric Image to Video Generation with Language Guidance},
author={Villar-Corrales, Angel, and Plepi, Gjergj and Behnke, Sven},
journal={arXiv preprint arXiv:2502.11655},
year={2025}
}
In case of any questions or problems regarding the project or repository, do not hesitate to contact the authors at villar@ais.uni-bonn.de
.