Emergence of Human to Robot Transfer in Vision-Language-Action Models

01 动机

训练机器人需要大量专门的遥操作数据，成本高且难以规模化。互联网上拥有海量人类操作视频，理论上可以极大丰富训练数据多样性——但人与机器人在形态、视角、运动方式上存在巨大差异，直接利用并非易事。

过去的研究尝试用显式的跨形态对齐（如视频预测、形态嵌入对齐）来桥接这一鸿沟，但效果不稳定。本文提出一个不同视角：这种迁移能力是否是大规模 VLA 预训练多样性带来的涌现属性，而非需要专门设计的桥接机制？

"Human-to-robot transfer is an emergent property of diverse VLA pretraining."

Scaling and transfer results across pretraining diversity — **Figure 2（原文）：**"Per-task improvement from human data: We plot the difference in performance between policies fine-tuned with robot + human data versus robot-only data, isolating the lift from human supervision. Gains are largest when pre-training spans diverse tasks, scenes, and embodiments, suggesting that broad pre-training improves transfer from human videos." — 当预训练多样性从 0% 提升至 100%，人类数据带来的增益从接近 0 跃升至显著正值，佐证了涌现假说。

+39%Spice 任务成功率提升（32%→71%）

+25%Dresser 任务成功率提升（25%→50%）

+21%Sort Eggs 准确率提升（57%→78%）

14h人类示范视频总时长（4 个任务）

02 方法

本文在 π0.5（pi-0.5）VLA 基础上，以 50-50 的比例混合人类示范与机器人数据进行 fine-tuning，同时预测 high-level 子任务和 low-level 连续动作，无需任何显式的跨形态对齐机制。

Architecture: co-training pipeline — **系统架构：**人类操作者佩戴头戴摄像头（及可选腕部摄像头），所采集的 ego-centric 视频经过手部 3D 关键点提取，转换为相对 6-DoF end-effector 轨迹，与机器人数据共同 fine-tune π0.5 VLA 模型（π0.5+ego）。

数据采集：人类示范

实验人员佩戴头戴摄像头（模拟机器人主相机视角）与左右腕部摄像头，在四个任务上共采集约 14 小时人类示范数据：

Bussing（餐具收拾）：3 小时，移动操作任务
Spice Rack（香料整理）：3 小时，桌面操作任务
Dresser（梳妆台整理）：3 小时，桌面操作任务
Sort Eggs（鸡蛋分拣）：5 小时，桌面操作任务

Data collection setup and task overview — **Figure 3（原文）：**"Training mixture and benchmark. Our fine-tuning mix is evenly split between human data for generalization tasks and robot data for the nearest neighbor task." — 左侧展示人类佩戴设备采集场景，右侧为四个任务的机器人执行场景，fine-tuning 数据各占 50%。

动作空间对齐

人类手部 3D 关键点被转换为相对 6-DoF end-effector 轨迹，维度与机器人动作空间（18 维：双臂各 6 维 + 底盘移动 6 维）相同，但不估计人类夹爪状态（夹爪动作仅从机器人数据学习）。

Co-training 策略

在 π0.5 基础上同时预测 high-level subtask（子任务语言描述）与 low-level continuous action（连续动作），混合比例固定为 robot:human = 50:50。研究特别考察了不同 pretraining diversity（0%～100%）对迁移效果的影响，以验证涌现假说。

03 实验

在 4 个操作任务上评估 co-training（Robot + Human）与纯机器人数据 fine-tuning（Robot only）的差异，并系统分析 pretraining diversity、wrist camera、high-level/low-level 分工等关键因素。

主要性能对比

任务	Robot Only 基线	Robot + Human（本文）	提升
Spice Rack	32%	71%	+39%
Dresser	25%	50%	+25%
Bussing	53%	63%	+10%
Sort Eggs	57%	78%	+21%

所有数字均来自原文，verbatim。

Task performance results — **Figure 7（原文）：**"Human to robot transfer finetuning π0.5. We evaluate performance across a suite of static and mobile tasks, each testing either scene, object, or task level generalization present only in the human data. We see clear human to robot transfer resulting in nearly double the score on the target tasks."

Pretraining Diversity 的涌现效应

实验系统地改变预训练数据集的任务/场景/形态多样性（0%～100%），发现：

0%～25% 多样性时，co-training 人类数据几乎无收益甚至略有下降
75%～100% 多样性时，人类数据带来的提升显著跃升
加入跨形态机器人数据（cross-embodiment pretraining）后效果最强

"Performance of Robot Finetuning on Sort Eggs plateaus, even as the pretraining diversity improves. In contrast, Human+Robot Finetuning performance scales sharply with pretraining, suggesting that broader pretraining enables more effective transfer from human data."

表征分析（TSNE 可视化）

TSNE representation analysis — **Figure 5（原文）：**"VLA representation of human and robot data. We plot the latent embeddings of our VLA by performing a TSNE analysis on mean-pooled tokens from the final layer of the VLM backbone. With no pre-training, it is clear that the model has disjoint representations between human and robot data. But as pretraining becomes more diverse, latent overlap increases, which correlates with performance on our generalization tasks." — 预训练多样性越高，人类与机器人的 latent embedding 越趋于融合，直接解释了涌现迁移能力的来源。

消融实验

Ablation: target robot data vs human data — 消融：人类数据 vs. 目标任务机器人数据对比。Sort Eggs 和 Dresser 任务中，人类数据效果接近甚至媲美目标机器人数据；Bussing 任务中目标机器人数据（65%）显著优于人类数据（25%）。

High-level vs. Low-level 分工

对于移动操作任务（Bussing、Dresser），high-level 子任务预测和 low-level 动作预测均有必要；两者共同 co-training 才能充分利用人类示范数据。

腕部摄像头的作用

腕部摄像头对精细操作任务（Bussing、Dresser）有明显帮助，对其他任务（Spice、Eggs）影响有限，符合直觉："some (but not all) tasks will benefit from the added observability of wrist cameras."

04 局限性

Note：以下局限性部分来自作者在原文中的明确陈述（stated），部分为从方法设计推断（inferred）。

依赖大规模、多样化的机器人预训练数据（stated）

Human-to-robot transfer 的涌现需要"vast datasets of robot teleoperation data in pretraining"作为前提。对于没有大规模预训练基础的场景，本方法的收益将大幅缩水，限制了其通用性。

人类夹爪状态未估计（stated）

当前实现无法从人类手部关键点直接估计夹爪开合状态，夹爪动作完全依赖机器人数据学习。这是一个明确的信息缺失，可能影响需要精确抓握的任务迁移效果。

人类数据规模仍有限（stated）

当前实验使用约 14 小时的情景式人类示范数据。作者展望未来可以扩展至更大规模的人类视频，包括日常活动的被动录制，认为模型规模持续扩大将解锁更多涌现能力。

某些任务上人类数据效果有限（inferred from results）

Bussing 任务中，人类数据（25% success）远不及目标机器人数据（65%），说明对于运动复杂度高、形态差异大的任务，跨形态迁移存在固有瓶颈。