TextOCVP: Text-Conditioned Object-Centric Video Prediction

Official implementation of: Object-Centric Image to Video Generation with Language Guidance by Angel Villar-Corrales, Gjergj Plepi and Sven Behnke. ArXiv Preprint. 2025.

[Paper] [Project Page] [BibTeX]

TextOCVP

Text-Conditioned Predictor

the medium green rubber cone covers the gold snitch. the large purple rubber cone is picked up and placed to (-1, 3). the large yellow rubber cone is sliding to (2, 3). the small gold metal snitch is picked up and placed to (-3, 1). the medium green metal sphere is sliding to (2, 1). the small brown metal cube is picked up and placed to (-3, 1).

Installation and Dataset Preparation

Clone the repository and install all required packages including in our conda environment:

git clone git@github.com:angelvillar96/TextOCVP.git
cd TextOCVP
conda env create -f environment.yml

Download and extract the pretrained models, including checkpoints for the SAVi decomposition, predictor modules and behaviour modules:

chmod +x download_pretrained.sh
./download_pretrained.sh

Download the datasets and place them in the datasets folder.

CATER: You can download the CATER dataset from the links provided in the original MAGE repository: CATER Easy CATER Hard
CLIPort: Contact the authors at villar@ais.uni-bonn.de to get access to this data.

Training

We refer to docs/TRAIN.md for detailed instructions for training TextOCVP. We include instructions for all training stages, including training the Object-Centric Video Decomposition model, as well as training the Text-Conditioned predictor.

Additionally, we provide instructions on how to support your own dataset.

Evaluation and Figure Generation

We provide bash scripts for evaluating and generating figures using our pretrained checkpoints.
Simply run the bash scripts by:

./scripts/SCRIPT_NAME

Example:

./scripts/05_evaluate_TextOCVP_CATER.sh 
./scripts/06_generate_figs_pred_CATER.sh

./scripts/05_evaluate_TextOCVP_CLIPORT.sh 
./scripts/06_generate_figs_pred_CLIPORT.sh

Below we discuss in more detail the different evaluation and figure generation scripts and processes.

Evaluate Object-Centric Model for Video Decomposition

You can quantitatively and qualitatively evaluate an object-centric video decomposition model, i.e. SAVi or ExtendedDINOSAUR, using the src/03_evaluate_decomp_model.py and src/06_generate_figs_decomp_model.py scripts, respectively.

This scrips will evaluate the model on the test set and generate figures for the results.

Example:

python src/03_evaluate_decomp_model.py \
    -d experiments/TextOCVP_CLIPort/ \
    --decomp_ckpt ExtendedDINOSAUR_CLIPort.pth \
    --results_name results_DecompModel

python src/06_generate_figs_decomp_model.py \
    -d experiments/TextOCVP_CLIPort/ \
    --decomp_ckpt ExtendedDINOSAUR_CLIPort.pth \
    --num_seqs 10

Show Object-Centric Decomposition Figures

Evaluate TextOCVP for Object-Centric Image-to-Video Generation

You can quantitatively evaluate TextOCVP for video prediction using the src/05_evaluate_predictor.py script. This script takes pretrained Object-Centric Decomposition and TextOCVP checkpoints and evaluates the visual quality of the predicted frames.

Example:

python src/05_evaluate_predictor.py \
    -d experiments/TextOCVP_CLIPort/ \
    --decomp_ckpt ExtendedDINOSAUR_CLIPort.pth \
    --name_pred_exp TextOCVP \
    --pred_ckpt TextOCVP_CLIPort.pth \
    --results_name results_TextOCVP_NumSeed=1_NumPreds=19 \
    --num_seed 1 \
    --num_preds 19 \
    --batch_size 8

Similarly, you can qualitatively evaluate the models using the src/06_generate_figs_predictor.py script.

Example:

python src/06_generate_figs_predictor.py \
    -d experiments/TextOCVP_CLIPort/ \
    --decomp_ckpt ExtendedDINOSAUR_CLIPort.pth \
    --name_pred_exp TextOCVP \
    --pred_ckpt TextOCVP_CLIPort.pth \
    --num_preds 29 \
    --num_seqs 10

Show Example Outputs of `src/06_generate_figs_predictor.py`

Generating figures with TextOCVP should produce animations as follows:

Related Works

If you consider our work interesting, you may also want to check out our related works:

OCVP: Object-Centric Video Prediction via decoupling object dynamics and interactions.
SOLD: Model-based reinforcement learning with object-centric representations.
PlaySlot: Learning inverse dynamics for controllable object-centric video prediction and planning.

Contact and Citation

This repository is maintained by Angel Villar-Corrales and Gjergj Plepi.

Please consider citing our paper if you find our work or repository helpful.

@article{villar_TextOCVP_2025,
  title={Object-Centric Image to Video Generation with Language Guidance},
  author={Villar-Corrales, Angel, and Plepi, Gjergj and Behnke, Sven},
  journal={arXiv preprint arXiv:2502.11655},
  year={2025}
}

In case of any questions or problems regarding the project or repository, do not hesitate to contact the authors at villar@ais.uni-bonn.de.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TextOCVP: Text-Conditioned Object-Centric Video Prediction

Installation and Dataset Preparation

Training

Evaluation and Figure Generation

Evaluate Object-Centric Model for Video Decomposition

Evaluate TextOCVP for Object-Centric Image-to-Video Generation

Related Works

Contact and Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
datasets		datasets
experiments		experiments
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
download_pretrained.sh		download_pretrained.sh
environment.yml		environment.yml

angelvillar96/TextOCVP

Folders and files

Latest commit

History

Repository files navigation

TextOCVP: Text-Conditioned Object-Centric Video Prediction

Installation and Dataset Preparation

Training

Evaluation and Figure Generation

Evaluate Object-Centric Model for Video Decomposition

Evaluate TextOCVP for Object-Centric Image-to-Video Generation

Related Works

Contact and Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages