XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations
A versatile and scalable vision-language-action framework: XR-1 supports robust multi-task learning across diverse robot embodiments and environments.
Shichao Fan1,*, Kun Wu1,*, Zhengping Che1,*,†, Xinhua Wang1, Di Wu1,4, Fei Liao1, Ning Liu1, Yixue Zhang1, Zhen Zhao1, Zhiyuan Xu1, Meng Li1, Qingjie Liu3, Shanghang Zhang4, Min Wan2, Jian Tang1,✉
1Beijing Innovation Center of Humanoid Robotics, 2School of Mechanical Engineering and Automation, Beihang University, 3State Key Laboratory of Virtual Reality Technology and Systems, SCSE, Beihang University, 4State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
*Co-first authors, †Project leader, ✉Corresponding author,
[📖 Document] [🚀 Quick Start] [🤗 Models] [🤖 Deployment] [✅ Performance] [🙋 FAQs]
- Release pre-training / fine-tuning code for XR-1 series.
- Release pre-trained model, and heterogeneous dataset sample of XR-1 on both HuggingFace and ModelScope.
- Release real world deloyment sample of XR-1.
This repository is built upon a fork of Lerobot. Please note that due to the rapid updates of Lerobot, our implementation is specifically aligned with Lerobot dataset v2.1. We have preserved the original directory structure to facilitate further development and integration for the community.
Download our source code:
git clone https://github.com/Open-X-Humanoid/XR-1.git
cd XR-1Create a virtual environment with Python 3.10 and activate it, e.g. with miniconda, then install the dependencies:
conda create -y -n xr1 python=3.10
conda activate xr1
pip install -e ".[xr1]"-
Format Compatibility: Since our environment relies on LeRobot Dataset v2.1, we recommend using any4lerobot to convert your data to this standard.
-
Sample Data: We provide a heterogeneous dataset sample (including EGO4D and Robot data like TienKung2/UR/Franka) available at X-Humanoid/XR-1-Dataset-Sample. You can download it using the provided script:
scripts/hf_xr1_dataset_sample_download.shorscripts/modelscope_xr1_dataset_sample_download.sh. -
Unified Dataloader: We have designed a powerful dataloader that unifies heterogeneous data sources and embodiments, making pre-training extremely simple. You can find the implementation in
examples/xr1_cross_dataset_and_embodiment_dataloader.py.Key enhancements over the original LeRobot dataloader:
- Unified Data Loading: Seamlessly reads data from diverse sources and embodiments.
- Multi-Task Support: Compatible with heterogeneous multi-task learning.
- Few-Shot Capabilities: Supports training with small sample sizes.
- Extensibility: Easily adaptable to new formats (e.g., non-LeRobot formats like Ego4D) with minimal development.
To set up the model environment, first download the foundation models (e.g., SigLIP, PaliGemma) by running:
# Huggingface
bash scripts/hf_download.shThen, to obtain the XR-1-Stage1-UVMC and XR-1-Stage2-Pretrain models for fine-tuning, run:
# Huggingface
bash scripts/hf_xr1_pretrain_model_download.sh
# Or ModelScope
bash modelscope_xr1_pretrain_model_download.shWe provide three training paths depending on your data and performance requirements:
If you need to quickly adapt the model to a new task or robot, you can fine-tune only the Stage 3. This is the fastest way to get a deployable model:
# Debug Mode (For testing configurations):
bash scripts/xr1_stage3_finetune.sh --debug
# Standard Training (Default):
bash scripts/xr1_stage3_finetune.sh
For custom datasets where you aim for optimal performance, we strongly recommend fine-tuning all three stages (Stage 1, 2, and 3) sequentially to better align the representations with your specific data:
# Full Fine-tuning Stage1 & Stage2 & stage3
bash scripts/xr1_stage1_finetune.sh
bash scripts/xr1_stage2_finetune.sh
bash scripts/xr1_stage3_finetune.sh (optional)Our framework fully supports pre-training if you have access to large-scale, heterogeneous datasets across diverse embodiments and environments:
# Pre-training Stage1 & Stage2
bash scripts/xr1_stage1_pretrain.sh
bash scripts/xr1_stage2_pretrain.shWe provide a streamlined workflow to deploy and verify XR-1 on various robotic platforms, including Franka, UR, and Agilex. The following example demonstrates the process using a Franka dual-arm robot:
# 1. Perform Fast Fine-tuning to train a specific Stage 3 model
# Franka
bash scripts/xr1_stage3_finetune.sh --debug --dataset XR_1_DATASET_DUAL_ARM_FRANKA
# Or Tienkung2
bash scripts/xr1_stage3_finetune.sh --debug --dataset XR_1_DATASET_DUAL_ARM_TIEN_KUNG2
# 2. Execute the deployment script
python deploy/real_robot/xr1_deploy.pyFor deployment on TienKung 2.0, we recommend referring to the x-humanoid-training-toolchain for specialized instructions.
Dual-Arm UR-5e |
Tien Kung 2.0 |
Tien Kung 1.0 |
Dual-Arm Franka |
AgileX Cobot Magic V2.0 |
Single-Arm UR-5e |
If you encounter any issues, feel free to open an issue on GitHub or reach out through discussions. We appreciate your feedback and contributions! 🚀
This project is released under the Apache License. Parts of this project contain code and models from other sources, which are subject to their respective licenses.
If you find this project useful in your research, please consider cite:
@article{fan2025xr,
title={XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations},
author={Fan, Shichao and Wu, Kun and Che, Zhengping and Wang, Xinhua and Wu, Di and Liao, Fei and Liu, Ning and Zhang, Yixue and Zhao, Zhen and Xu, Zhiyuan and others},
journal={arXiv preprint arXiv:2511.02776},
year={2025}
}XR-1 is built with reference to the code of the following projects: Lerobot, Moto, QueST and Pi0. Thanks for their awesome work!
If you're interested in XR-1, welcome to join our WeChat group for discussions.







