Skip to content

Official implementation of: "Object-Centric Image to Video Generation with Language Guidance" by Villar-Corrales, Plepi & Behnke, 2025

Notifications You must be signed in to change notification settings

angelvillar96/TextOCVP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TextOCVP: Text-Conditioned Object-Centric Video Prediction

Official implementation of: Object-Centric Image to Video Generation with Language Guidance by Angel Villar-Corrales, Gjergj Plepi and Sven Behnke. ArXiv Preprint. 2025.

[Paper]    [Project Page]    [BibTeX]

TextOCVP
Text-Conditioned Predictor
the medium green rubber cone covers the gold snitch. the large purple rubber cone is picked up and placed to (-1, 3). the large yellow rubber cone is sliding to (2, 3). the small gold metal snitch is picked up and placed to (-3, 1). the medium green metal sphere is sliding to (2, 1). the small brown metal cube is picked up and placed to (-3, 1).

Installation and Dataset Preparation

  1. Clone the repository and install all required packages including in our conda environment:
git clone git@github.com:angelvillar96/TextOCVP.git
cd TextOCVP
conda env create -f environment.yml
  1. Download and extract the pretrained models, including checkpoints for the SAVi decomposition, predictor modules and behaviour modules:
chmod +x download_pretrained.sh
./download_pretrained.sh
  1. Download the datasets and place them in the datasets folder.
  • CATER: You can download the CATER dataset from the links provided in the original MAGE repository:    CATER Easy    CATER Hard

  • CLIPort: Contact the authors at villar@ais.uni-bonn.de to get access to this data.

Training

We refer to docs/TRAIN.md for detailed instructions for training TextOCVP. We include instructions for all training stages, including training the Object-Centric Video Decomposition model, as well as training the Text-Conditioned predictor.

Additionally, we provide instructions on how to support your own dataset.

Evaluation and Figure Generation

We provide bash scripts for evaluating and generating figures using our pretrained checkpoints.
Simply run the bash scripts by:

./scripts/SCRIPT_NAME

Example:

./scripts/05_evaluate_TextOCVP_CATER.sh 
./scripts/06_generate_figs_pred_CATER.sh

./scripts/05_evaluate_TextOCVP_CLIPORT.sh 
./scripts/06_generate_figs_pred_CLIPORT.sh

Below we discuss in more detail the different evaluation and figure generation scripts and processes.

Evaluate Object-Centric Model for Video Decomposition

You can quantitatively and qualitatively evaluate an object-centric video decomposition model, i.e. SAVi or ExtendedDINOSAUR, using the src/03_evaluate_decomp_model.py and src/06_generate_figs_decomp_model.py scripts, respectively.

This scrips will evaluate the model on the test set and generate figures for the results.

Example:

python src/03_evaluate_decomp_model.py \
    -d experiments/TextOCVP_CLIPort/ \
    --decomp_ckpt ExtendedDINOSAUR_CLIPort.pth \
    --results_name results_DecompModel

python src/06_generate_figs_decomp_model.py \
    -d experiments/TextOCVP_CLIPort/ \
    --decomp_ckpt ExtendedDINOSAUR_CLIPort.pth \
    --num_seqs 10
Show Object-Centric Decomposition Figures

Evaluate TextOCVP for Object-Centric Image-to-Video Generation

You can quantitatively evaluate TextOCVP for video prediction using the src/05_evaluate_predictor.py script. This script takes pretrained Object-Centric Decomposition and TextOCVP checkpoints and evaluates the visual quality of the predicted frames.

Example:

python src/05_evaluate_predictor.py \
    -d experiments/TextOCVP_CLIPort/ \
    --decomp_ckpt ExtendedDINOSAUR_CLIPort.pth \
    --name_pred_exp TextOCVP \
    --pred_ckpt TextOCVP_CLIPort.pth \
    --results_name results_TextOCVP_NumSeed=1_NumPreds=19 \
    --num_seed 1 \
    --num_preds 19 \
    --batch_size 8

Similarly, you can qualitatively evaluate the models using the src/06_generate_figs_predictor.py script.

Example:

python src/06_generate_figs_predictor.py \
    -d experiments/TextOCVP_CLIPort/ \
    --decomp_ckpt ExtendedDINOSAUR_CLIPort.pth \
    --name_pred_exp TextOCVP \
    --pred_ckpt TextOCVP_CLIPort.pth \
    --num_preds 29 \
    --num_seqs 10
Show Example Outputs of `src/06_generate_figs_predictor.py` Generating figures with TextOCVP should produce animations as follows:

Related Works

If you consider our work interesting, you may also want to check out our related works:

  • OCVP: Object-Centric Video Prediction via decoupling object dynamics and interactions.
  • SOLD: Model-based reinforcement learning with object-centric representations.
  • PlaySlot: Learning inverse dynamics for controllable object-centric video prediction and planning.

Contact and Citation

This repository is maintained by Angel Villar-Corrales and Gjergj Plepi.

Please consider citing our paper if you find our work or repository helpful.

@article{villar_TextOCVP_2025,
  title={Object-Centric Image to Video Generation with Language Guidance},
  author={Villar-Corrales, Angel, and Plepi, Gjergj and Behnke, Sven},
  journal={arXiv preprint arXiv:2502.11655},
  year={2025}
}

In case of any questions or problems regarding the project or repository, do not hesitate to contact the authors at villar@ais.uni-bonn.de.

About

Official implementation of: "Object-Centric Image to Video Generation with Language Guidance" by Villar-Corrales, Plepi & Behnke, 2025

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published