XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations

A versatile and scalable vision-language-action framework: XR-1 supports robust multi-task learning across diverse robot embodiments and environments.

Shichao Fan^1,*, Kun Wu^1,*, Zhengping Che^1,*,†, Xinhua Wang¹, Di Wu^1,4, Fei Liao¹, Ning Liu¹, Yixue Zhang¹, Zhen Zhao¹, Zhiyuan Xu¹, Meng Li¹, Qingjie Liu³, Shanghang Zhang⁴, Min Wan², Jian Tang^1,✉

¹Beijing Innovation Center of Humanoid Robotics, ²School of Mechanical Engineering and Automation, Beihang University, ³State Key Laboratory of Virtual Reality Technology and Systems, SCSE, Beihang University, ⁴State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University

^*Co-first authors, ^†Project leader, ^✉Corresponding author,

[📖 Document] [🚀 Quick Start] [🤗 Models] [🤖 Deployment] [✅ Performance] [🙋 FAQs]

TODO List

Release pre-training / fine-tuning code for XR-1 series.
Release pre-trained model, and heterogeneous dataset sample of XR-1 on both HuggingFace and ModelScope.
Release real world deloyment sample of XR-1.

Model Download

Documents

This repository is built upon a fork of Lerobot. Please note that due to the rapid updates of Lerobot, our implementation is specifically aligned with Lerobot dataset v2.1. We have preserved the original directory structure to facilitate further development and integration for the community.

🚀 Quick Start

📑 Installation

Download our source code:

git clone https://github.com/Open-X-Humanoid/XR-1.git
cd XR-1

Create a virtual environment with Python 3.10 and activate it, e.g. with miniconda, then install the dependencies:

conda create -y -n xr1 python=3.10
conda activate xr1
pip install -e ".[xr1]"

📑 Dataset Preparation

Format Compatibility: Since our environment relies on LeRobot Dataset v2.1, we recommend using any4lerobot to convert your data to this standard.
Sample Data: We provide a heterogeneous dataset sample (including EGO4D and Robot data like TienKung2/UR/Franka) available at X-Humanoid/XR-1-Dataset-Sample. You can download it using the provided script: scripts/hf_xr1_dataset_sample_download.sh or scripts/modelscope_xr1_dataset_sample_download.sh.
Unified Dataloader: We have designed a powerful dataloader that unifies heterogeneous data sources and embodiments, making pre-training extremely simple. You can find the implementation in examples/xr1_cross_dataset_and_embodiment_dataloader.py.

Key enhancements over the original LeRobot dataloader:
- Unified Data Loading: Seamlessly reads data from diverse sources and embodiments.
- Multi-Task Support: Compatible with heterogeneous multi-task learning.
- Few-Shot Capabilities: Supports training with small sample sizes.
- Extensibility: Easily adaptable to new formats (e.g., non-LeRobot formats like Ego4D) with minimal development.

🤗 Models

To set up the model environment, first download the foundation models (e.g., SigLIP, PaliGemma) by running:

# Huggingface
bash scripts/hf_download.sh

Then, to obtain the XR-1-Stage1-UVMC and XR-1-Stage2-Pretrain models for fine-tuning, run:

# Huggingface
bash scripts/hf_xr1_pretrain_model_download.sh
# Or ModelScope
bash modelscope_xr1_pretrain_model_download.sh

📖 Training Recipe

We provide three training paths depending on your data and performance requirements:

📑 Fast Fine-tuning (For Quick Deployment)

If you need to quickly adapt the model to a new task or robot, you can fine-tune only the Stage 3. This is the fastest way to get a deployable model:

# Debug Mode (For testing configurations):
bash scripts/xr1_stage3_finetune.sh --debug
# Standard Training (Default):
bash scripts/xr1_stage3_finetune.sh

📑 Full Fine-tuning (Recommended for Best Performance)

For custom datasets where you aim for optimal performance, we strongly recommend fine-tuning all three stages (Stage 1, 2, and 3) sequentially to better align the representations with your specific data:

# Full Fine-tuning Stage1 & Stage2 & stage3
bash scripts/xr1_stage1_finetune.sh
bash scripts/xr1_stage2_finetune.sh
bash scripts/xr1_stage3_finetune.sh (optional)

📑 Pre-training from Scratch

Our framework fully supports pre-training if you have access to large-scale, heterogeneous datasets across diverse embodiments and environments:

# Pre-training Stage1 & Stage2
bash scripts/xr1_stage1_pretrain.sh
bash scripts/xr1_stage2_pretrain.sh

🤖 XR-1 Deployment

We provide a streamlined workflow to deploy and verify XR-1 on various robotic platforms, including Franka, UR, and Agilex. The following example demonstrates the process using a Franka dual-arm robot:

# 1. Perform Fast Fine-tuning to train a specific Stage 3 model
# Franka
bash scripts/xr1_stage3_finetune.sh --debug --dataset XR_1_DATASET_DUAL_ARM_FRANKA
# Or Tienkung2
bash scripts/xr1_stage3_finetune.sh --debug --dataset XR_1_DATASET_DUAL_ARM_TIEN_KUNG2
# 2. Execute the deployment script
python deploy/real_robot/xr1_deploy.py

For deployment on TienKung 2.0, we recommend referring to the x-humanoid-training-toolchain for specialized instructions.

✅ Performance in Real-world

Dual-Arm UR-5e	Tien Kung 2.0
Tien Kung 1.0	Dual-Arm Franka
AgileX Cobot Magic V2.0	Single-Arm UR-5e

🤗 FAQs

If you encounter any issues, feel free to open an issue on GitHub or reach out through discussions. We appreciate your feedback and contributions! 🚀

License

This project is released under the Apache License. Parts of this project contain code and models from other sources, which are subject to their respective licenses.

Citation

If you find this project useful in your research, please consider cite:

@article{fan2025xr,
  title={XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations},
  author={Fan, Shichao and Wu, Kun and Che, Zhengping and Wang, Xinhua and Wu, Di and Liao, Fei and Liu, Ning and Zhang, Yixue and Zhao, Zhen and Xu, Zhiyuan and others},
  journal={arXiv preprint arXiv:2511.02776},
  year={2025}
}

Acknowledgement

XR-1 is built with reference to the code of the following projects: Lerobot, Moto, QueST and Pi0. Thanks for their awesome work!

Discussions

If you're interested in XR-1, welcome to join our WeChat group for discussions.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets/images		assets/images
data		data
deploy/real_robot		deploy/real_robot
examples		examples
lerobot		lerobot
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations

TODO List

Model Download

Documents

🚀 Quick Start

📑 Installation

📑 Dataset Preparation

🤗 Models

📖 Training Recipe

📑 Fast Fine-tuning (For Quick Deployment)

📑 Full Fine-tuning (Recommended for Best Performance)

📑 Pre-training from Scratch

🤖 XR-1 Deployment

✅ Performance in Real-world

🤗 FAQs

License

Citation

Acknowledgement

Discussions

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

Open-X-Humanoid/XR-1

Folders and files

Latest commit

History

Repository files navigation

XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations

TODO List

Model Download

Documents

🚀 Quick Start

📑 Installation

📑 Dataset Preparation

🤗 Models

📖 Training Recipe

📑 Fast Fine-tuning (For Quick Deployment)

📑 Full Fine-tuning (Recommended for Best Performance)

📑 Pre-training from Scratch

🤖 XR-1 Deployment

✅ Performance in Real-world

🤗 FAQs

License

Citation

Acknowledgement

Discussions

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages