Home /Research /Tenma: Robust Cross-Embodiment Robot Manipulation with Diffusion Transformer
MANIPULATION

Tenma: Robust Cross-Embodiment Robot Manipulation with Diffusion Transformer

Travis Davies, Yiqi Huang, Yunxin Liu, Xiang Chen, Huxian Liu, Luhui Hu

Year
2025
Access
Open access

Abstract

Scaling Transformer policies and diffusion models has advanced robotic manipulation, yet combining these techniques in lightweight, cross-embodiment learning settings remains challenging. We study design choices that most affect stability and performance for diffusion-transformer policies trained on heterogeneous, multimodal robot data, and introduce Tenma, a lightweight diffusion-transformer for bi-manual arm control. Tenma integrates multiview RGB, proprioception, and language via a cross-embodiment normalizer that maps disparate state/action spaces into a shared latent space; a Joint State-Time encoder for temporally aligned observation learning with inference speed boosts; and a diffusion action decoder optimized for training stability and learning capacity. Across benchmarks and under matched compute, Tenma achieves an average success rate of 88.95% in-distribution and maintains strong performance under object and scene shifts, substantially exceeding baseline policies whose best in-distribution average is 18.12%. Despite using moderate data scale, Tenma delivers robust manipulation and generalization, indicating the great potential for multimodal and cross-embodiment learning strategies for further augmenting the capacity of transformer-based imitation learning policies.

Keywords

cs.ROcs.AI

Related papers

Browse all MANIPULATION papers