ROCKET is a multi-layer representation alignment framework that injects 3D spatial reasoning from a strong vision foundation model (VGGT) into 2D-pretrained VLA models. It features:
- Multi-layer alignment — leverages spatial cues across multiple depths instead of a single layer
- Shared projector — a single projector shared across layers to reduce gradient interference
- Matryoshka-style sparse activation — progressively increases projector capacity from shallow to deep layers
ROCKET achieves 98.5% average success rate on LIBERO with only ~4% of the compute budget of prior SOTA methods.
ROCKET-VLA/
├── openvla-ROCKET/ # OpenVLA-7B backbone (simulation: LIBERO, LIBERO-Plus)
│ ├── vla-scripts/ # Core Python scripts (training, profiling, deploy)
│ ├── ROCKET-VLA_scripts/ # Bash scripts for training & analysis
│ │ ├── training_scripts/ # Training: ROCKET, ablations, baseline
│ │ └── profile_scripts/ # CKA, gradient, projector similarity, layer importance
│ ├── prismatic/ # Model architecture & training utilities
│ └── vggt/ # VGGT teacher model
└── openpi-ROCKET/ # PI0 / PI0.5 backbone (real-world: RoboTwin 2.0, ALOHA)
cd openvla-ROCKET
# Initialize git submodules (LIBERO + transformers-openvla-oft)
git submodule update --init --recursive
conda create -n rocket python=3.10.16 -y
conda activate rocket
pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0
pip install -e .
# Flash Attention 2 (required for training)
pip install packaging ninja
pip install "flash-attn==2.5.5" --no-build-isolationInstall the LIBERO benchmark (pulled as a git submodule), then download datasets (~10 GB):
pip install -e LIBERO
pip install -r experiments/robot/libero/libero_requirements.txt
# Download LIBERO datasets (Spatial, Object, Goal, 10)
git clone git@hf.co:datasets/openvla/modified_libero_rlds ./data/liberoDownload pretrained models:
mkdir -p ckpts
# OpenVLA-7B: https://huggingface.co/openvla/openvla-7b
# VGGT-1B: https://huggingface.co/facebook/VGGT-1B/blob/main/model.ptExpected directory structure:
openvla-ROCKET/
├── ckpts/
│ ├── openvla-7b/ # OpenVLA-7B weights
│ └── VGGT-1B/model.pt # VGGT-1B checkpoint
├── data/
│ └── libero/
│ ├── libero_spatial_no_noops/
│ ├── libero_object_no_noops/
│ ├── libero_goal_no_noops/
│ └── libero_10_no_noops/
Full ROCKET (shared projector + Matryoshka, 10 layer pairs):
bash ROCKET-VLA_scripts/training_scripts/run_align10_rocket.shWe provide pre-configured scripts for each ablation variant:
| Script | Method | Paper Reference |
|---|---|---|
training_scripts/run_align10_rocket.sh |
Full ROCKET (shared + Matryoshka) | Table 8 "+Matryoshka" |
training_scripts/run_align10_shared.sh |
Shared projector only | Table 8 "+Shared" |
training_scripts/run_align10_naive_multi_layers.sh |
Independent projectors | Table 8 "+Multi-layer" |
training_scripts/run_align1_spatial_forcing.sh |
Single-layer alignment | Spatial Forcing reproduction |
training_scripts/run_align0_baseline.sh |
Baseline (no alignment) | Table 8 "Baseline" |
All scripts are in ROCKET-VLA_scripts/ and call the same underlying vla-scripts/finetune_rocket.py with different configurations.
Key differences between scripts (all other parameters are shared):
| Script | --align_loss_coeff |
--share_projector |
--use_matryoshka |
--vla/vggt_layers_align |
|---|---|---|---|---|
run_align10_rocket.sh |
0.5 |
True |
True |
10 pairs |
run_align10_shared.sh |
0.5 |
True |
False |
10 pairs |
run_align10_naive_multi_layers.sh |
0.5 |
False |
False |
10 pairs |
run_align1_spatial_forcing.sh |
0.5 |
False |
False |
"24" / "-1" |
run_align0_baseline.sh |
0 |
False |
False |
10 pairs |
--use_matryoshka True: Matryoshka-style width allocation (shallow layers use fewer params).--projector_shallow_to_deep_increasecontrols the direction.--use_matryoshka False: all layers use full hidden_dim. Combined with--share_projector True, this is the "shared baseline" (Table 8 "+Shared").--ensemble_size n: initializesnprojectors per layer pair and averages their losses. Default1for all paper experiments.
All profiling scripts are in ROCKET-VLA_scripts/profile_scripts/:
CKA Similarity Analysis (Fig. 8, Appendix F):
# Step 1: Compute pairwise CKA similarity between all VLA (0-32) and VGGT (0-23) layers
bash ROCKET-VLA_scripts/profile_scripts/cka_profiling/run_cka_profiling.sh
# Step 2: Find optimal layer alignment via Hungarian algorithm
python vla-scripts/analyze_profiling_results.py \
profiling_results/<run_dir>/profiling_results.json \
--output_dir profiling_results/analysis/Gradient Conflict Profiling (Fig. 1, 13, 14):
# Profile cosine similarity between alignment and task loss gradients
bash ROCKET-VLA_scripts/profile_scripts/grad_conflict/run_gradient_profiling.shProjector Similarity Analysis (Fig. 7, Appendix E):
# Compare projector output similarity across training for ROCKET / naive / shared
bash ROCKET-VLA_scripts/profile_scripts/projector_similarity/run_align10_rocket.sh
bash ROCKET-VLA_scripts/profile_scripts/projector_similarity/run_align10_naive10.sh
bash ROCKET-VLA_scripts/profile_scripts/projector_similarity/run_align10_share10.sh
# Visualize results (heatmaps + comparison curves)
python ROCKET-VLA_scripts/profile_scripts/projector_similarity/visualize_projector_similarity.pyLayer Importance (layer selection utilities):
| Script | Description |
|---|---|
layer_importance/run_layer_importance.sh |
Measure per-layer cosine similarity scores |
layer_importance/rank_layer_importance.py |
Rank layers by importance, select top/bottom-k |
layer_importance/compute_layer_selection.py |
Balanced layer selection algorithm (uniform spacing) |
layer_importance/parse_similarity_log.py |
Parse profiling logs to CSV |
Evaluate on LIBERO:
python experiments/robot/libero/run_libero_eval.py \
--pretrained_checkpoint <ckpt_dir> \
--task_suite_name libero_spatialChange --task_suite_name to libero_spatial, libero_object, libero_goal, or libero_10.
We use uv to manage dependencies:
cd openpi-ROCKET
GIT_LFS_SKIP_SMUDGE=1 uv sync
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .
cp -r ./src/openpi/models_pytorch/transformers_replace/* .venv/lib/python3.11/site-packages/transformers/
source .venv/bin/activateGeneral usage pattern:
# Single GPU
uv run scripts/<train_script> <config_name> --exp_name <run_name>
# Multi-GPU (DDP, PyTorch only)
uv run torchrun --standalone --nnodes=1 --nproc_per_node=4 \
scripts/<train_script> <config_name> --exp_name <run_name>
# Resume from latest checkpoint
uv run torchrun --standalone --nnodes=1 --nproc_per_node=4 \
scripts/<train_script> <config_name> --exp_name <run_name> --resumeAll configs are defined in src/openpi/training/config.py. See openpi-ROCKET/README.md for full details.
LIBERO (PI0.5, full fine-tuning, PyTorch):
| Config | Method | Script | Notes |
|---|---|---|---|
pi05_libero_align10_rocket_64bsz |
ROCKET | train_pytorch_ROCKET.py |
10 layer pairs, shared projector |
pi05_libero_align1_spatial_forcing_64bsz |
Spatial Forcing | train_pytorch_ROCKET.py |
Single layer (VLA=12, VGGT=-1) |
pi05_libero_align0_baseline_64bsz |
Baseline | train_pytorch.py |
No VGGT, no alignment |
# Example: ROCKET on LIBERO
uv run torchrun --standalone --nnodes=1 --nproc_per_node=4 \
scripts/train_pytorch_ROCKET.py pi05_libero_align10_rocket_64bsz \
--exp_name rocket_libero
# Example: Baseline on LIBERO (no VGGT needed)
uv run torchrun --standalone --nnodes=1 --nproc_per_node=4 \
scripts/train_pytorch.py pi05_libero_align0_baseline_64bsz \
--exp_name baseline_liberoRoboTwin 2.0 (PI0, LoRA, JAX):
| Config | Method | Script | Notes |
|---|---|---|---|
pi0_base_aloha_robotwin_lora_rocket_MPA |
ROCKET | train_jax_ROCKET.py |
align_loss_coeff=0.125 |
pi0_base_aloha_robotwin_lora_spatial_forcing_MPA |
Spatial Forcing | train_jax_ROCKET.py |
Single layer |
pi0_base_aloha_robotwin_lora_baseline_MPA |
Baseline | train.py |
No VGGT |
# Example: ROCKET on RoboTwin
uv run scripts/train_jax_ROCKET.py pi0_base_aloha_robotwin_lora_rocket_MPA \
--exp_name rocket_robotwin
# Example: Baseline on RoboTwin (no VGGT needed)
uv run scripts/train.py pi0_base_aloha_robotwin_lora_baseline_MPA \
--exp_name baseline_robotwinReal-Robot (PI0.5, full fine-tuning, JAX):
| Config | Method | Script |
|---|---|---|
pi05_0312_250_3_15_74_resize_chunksize10_rocket |
ROCKET | train_jax_ROCKET.py |
pi05_0312_250_3_15_74_resize_chunksize10_spatial_forcing |
Spatial Forcing | train_jax_ROCKET.py |
pi05_0312_250_3_15_74_resize_chunksize10_baseline |
Baseline | train.py |
uv run scripts/train_jax_ROCKET.py pi05_0312_250_3_15_74_resize_chunksize10_rocket \
--exp_name rocket_realReal-robot configs reference a local dataset. Replace
repo_idin config.py with your own.
# Start model server
uv run scripts/serve_policy.py policy:checkpoint \
--policy.config=<config_name> \
--policy.dir=checkpoints/<config_name>/<run_name>/<step>
# Run client
uv run examples/simple_client/main.py --env ALOHA| Method | Spatial | Object | Goal | Long | Avg. |
|---|---|---|---|---|---|
| Spatial Forcing | 99.4 | 99.6 | 98.8 | 96.6 | 98.5 |
| ROCKET (Ours) | 98.2 | 99.8 | 98.8 | 97.0 | 98.5 |
ROCKET matches SOTA with ~4% of the compute budget:
| Method | Avg. | Cost | Details |
|---|---|---|---|
| OpenVLA-OFT | 97.1 | 25.6x | 4 x 64 x 150k |
| Spatial Forcing | 98.5 | 24.0x | 4 x 64 x 150k |
| GeoVLA | 97.7 | 3.2x | 1 x 256 x 20k |
| ROCKET | 98.5 | 1.0x | 1 x 32 x 50k |
| Method | Spatial | Object | Goal | Long | Avg. |
|---|---|---|---|---|---|
| Baseline | 96.9 | 98.7 | 97.6 | 94.8 | 96.4 |
| + Multi-layer (naive) | 93.6 | 99.2 | 42.2 | 85.0 | 80.0 |
| + Shared projector | 99.0 | 99.8 | 97.0 | 96.8 | 98.2 |
| + Matryoshka (full ROCKET) | 98.2 | 99.8 | 98.8 | 97.0 | 98.5 |
| Method | Spatial | Object | Goal | Long | Avg. |
|---|---|---|---|---|---|
| Baseline | 96.4 | 98.2 | 95.0 | 82.2 | 93.0 |
| Spatial Forcing | 97.8 | 97.8 | 94.4 | 85.8 | 94.0 |
| ROCKET | 96.4 | 98.8 | 96.6 | 89.2 | 95.3 |
ROCKET achieves the best average success rate (81.7%) across 7 perturbation types, with the strongest gains under Robot and Layout shifts (spatial geometry perturbations).
Evaluated on 5 bimanual tasks using ALOHA assets under Easy and Hard settings. ROCKET achieves a clear advantage in Easy and competitive performance in Hard.
@article{sun2026rocket,
title={ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models},
author={Sun, Guoheng and Du, Tingting and Feng, Kaixi and Luo, Chenxiang and Ding, Xingguo and Shen, Zheyu and Wang, Ziyao and He, Yexiao and Li, Ang},
journal={arXiv preprint arXiv:2602.17951},
year={2026}
}This codebase builds upon Spatial Forcing, OpenVLA, OpenVLA-OFT, and OpenPI. We thank the authors for their excellent work.
