SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

01 动机

机器人操作本质上是一个三维空间感知与动作规划问题，然而主流 VLA 模型（OpenVLA、Octo、RT-2-X 等）仅从 2D 图像 token 中学习动作，缺乏对物体位置、深度与空间布局的显式建模，在需要精细空间推理的任务（如堆叠、精确放置）上表现明显弱于专用方法。

"Spatial understanding is the key to robot manipulation … we propose SpatialVLA, a spatial visual-language-action model that focuses on exploring spatial representations for robot manipulation."

SpatialVLA overview — **图1：SpatialVLA 总览。** 给定图像观测 *o_t* 与任务指令 L，模型通过 Ego3D Position Encoding 处理图像，自回归预测空间动作 token，再反 tokenize 为连续动作 *A_t* 执行控制。

78.1%LIBERO 平均成功率（4 子任务，第 1 名）

81.0%SimplerEnv Google Robot 零样本 Visual Matching

34.4%SimplerEnv WidowX 零样本平均成功率

21 Hz推理频率（实时控制）

02 方法

SpatialVLA 由两个核心模块构成：（1）Ego3D Position Encoding——将深度估计得到的三维坐标编码叠加到 SigLIP 视觉 token 上；（2）Adaptive Action Grids——根据训练集动作分布自适应离散化连续 7D 动作，并支持跨机器人迁移。

Ego3D Position Encoding

给定 RGB 图像，首先用 ZoeDepth 估计深度图，再通过相机内参将每个像素反投影为三维坐标 P（相机自身坐标系，无需外参标定）。三维位置用正弦函数 γ(·) 编码后经 MLP 映射，与 SigLIP 提取的 2D 语义特征 X 相加融合：

O_3d = X + MLP(γ(P))

该设计以 plug-and-play 方式为视觉 token 注入空间感知，无需额外相机标定，适用于任意机器人平台。

Adaptive Action Grids

将连续 7D 动作（平移 x,y,z；旋转 roll,pitch,yaw；夹爪）离散化为可学习 token。关键创新在于自适应分箱：先将平移转为极坐标 (φ, θ, r) 解耦方向与距离，再对各维度拟合 Gaussian 分布，按等概率划分 M 个区间，使每个 bin 覆盖相同比例的训练动作，避免传统线性分箱在长尾分布上的浪费。

跨机器人迁移：微调至新机器人时，对目标数据集重新拟合 Gaussian，通过三线性插值将预训练 token embedding 对齐到新网格，保留空间先验同时快速适应新动作分布（即 Spatial Embedding Adaptation）。

Spatial embedding adaptation visualization — **图7：空间网格截面特征可视化。** Spatial Embedding Adaptation 将预训练空间网格特征与微调后模型特征对齐，改善初始化并加速收敛。左：无自适应；右：使用自适应后特征分布更一致。

预训练设置

以 Qwen2 为语言骨干，SigLIP 为视觉编码器。预训练数据为 Open X-Embodiment (OXE) 中 1.1M 真实机器人 episodes 的混合（Google Fractal、BridgeV2 等多机器人数据集）。Action grid 分辨率默认 8194 token，覆盖平移 + 旋转 + 夹爪各维度。

03 实验

评估涵盖三大维度：零样本控制（SimplerEnv）、适应新机器人（Franka + WidowX 微调）、空间理解能力（空间布局任务）。仿真基准 SimplerEnv 含 Google Robot 和 WidowX 两个平台，LIBERO 提供 4 个子任务集；真实机器人实验覆盖 7 类任务套件、16 个任务。

Experiment setup — **图3：实验配置。** 跨 7 类机器人学习场景、16 个真实机器人任务、48 个仿真配置评估，聚焦零样本控制、新配置适应性与空间理解三个核心问题。

SimplerEnv — Google Robot（表 I）

方法	Visual Matching	Variant Aggregation
RT-2-X	60.7%	—
OpenVLA	16.3%	46.2%
Octo-Base	17.0%	4.2%
RoboVLM (zero-shot)	72.7%	66.3%
π₀* (BF16 uniform)	88.0%	80.3%
SpatialVLA (zero-shot)	81.0%	69.6%
SpatialVLA (fine-tuning)	86.0%	77.9%

SimplerEnv — WidowX（表 II）

方法	平均成功率
RT-1-X	1.1%
Octo-Small	30.0%
OpenVLA	1.0%
RoboVLM (zero-shot)	13.5%
RoboVLM (fine-tuning)	31.3%
SpatialVLA (zero-shot)	34.4%
SpatialVLA (fine-tuning)	42.7%

LIBERO 仿真基准（表 III）

方法	LIBERO-Spatial	LIBERO-Object	LIBERO-Goal	LIBERO-Long	平均
Diffusion Policy	78.3±1.1%	92.5±0.7%	68.3±1.2%	50.5±1.3%	72.4±0.7%
Octo fine-tuned	78.9±1.0%	85.7±0.9%	84.6±0.9%	51.1±1.3%	75.1±0.6%
OpenVLA fine-tuned	84.7±0.9%	88.4±0.8%	79.2±1.0%	53.7±1.3%	76.5±0.6%
TraceVLA fine-tuned	84.6±0.2%	85.2±0.4%	75.1±0.3%	54.1±1.0%	74.8±0.5%
SpatialVLA fine-tuned	88.2±0.5%	89.9±0.7%	78.6±0.6%	55.5±1.0%	78.1±0.7%

Zero-shot WidowX evaluation — **图4：WidowX 机器人零样本控制评估。** 跨 7 类任务套件，探测语言 grounding、语义理解与运动感知能力，背景、物体姿态与运动干扰物均有变化。

Spatial understanding capability — **图6：空间理解能力评估。** 得益于 Ego3D Position Encoding，SpatialVLA 在需要理解空间提示词（如"左侧""前方"）及复杂空间布局的任务中表现显著优于基线。

消融实验

预训练消融（表 IV）显示：将 Adaptive Grids 替换为线性 256-bin 分箱后，Variant Aggregation 指标下降约 36.5%；去除 Ego3D 编码后，Google Robot 零样本性能下降 12.7%–15.2%。Action grid 分辨率从 1026 提升至 8194 持续带来收益。

微调消融（表 V）显示：对小规模 LIBERO 数据，LoRA + Spatial Embedding Adaptation 优于全参数微调；Spatial Embedding Adaptation 单独贡献 LIBERO-Spatial +4.6%（83.6% → 88.2%）。

04 局限性

Note: 以下局限性均为作者在论文 Discussion & Limitations 节中明确陈述。

Gaussian 分布建模不够最优

"Is modeling data distributions as Gaussian optimal? We argue that Gaussian modeling is suboptimal, as it can lead to grid clustering on specific coordinate axes in extreme robot operation scenarios, such as single-axis motion, resulting in lost motion capabilities on other axes."（单轴运动等极端场景下，Gaussian 拟合可能导致某些轴的网格过度聚集，使其他轴的运动能力退化。）

仅依赖当前帧，长时序任务表现受限

"As the model relies solely on current frame observations and history tokens for action prediction, it faces challenges in long-horizon tasks." 作者指出未来需要设计高效的历史信息感知机制以增强长序列建模能力。

推理速度慢于 diffusion 方法

"SpatialVLA achieves 21Hz inference speed, it is slower than diffusion decoding." 自回归 token 预测的推理开销高于基于扩散的策略网络，在对实时性要求极高的场景下存在瓶颈。

预训练数据质量参差

"The variable quality of OXE data can hinder training. Therefore, future work exploring optimal data composition and distilling high-quality subsets from the heterogeneous robot data collections is vital for boosting model efficiency and generalizability."