PRISM: : Planning and Reasoning with Intent in Simulated Embodied Environments
Yunn Kang Lim, Pengzhan Sun, Ziyi Bai, Xun Xu, Angela Yao, Xulei Yang, Shijie Li
- Year
- 2026
- Access
- Open access
Abstract
When an LLM-based embodied agent fails at a household task, the culprit could be misidentified objects, forgotten sub-goals, or poor action sequencing -- yet existing benchmarks report only a single success rate, making it impossible to tell which cognitive module is responsible. We present PRISM, a diagnostic benchmark that reframes this problem: rather than asking only \textit{did the agent succeed?}, PRISM asks \textit{which capability is most likely responsible for failure?} Built on five photorealistic multi-room apartments (4--8 rooms each), PRISM structures 300 human-verified tasks into three capability tiers -- \textit{Basic Ability}, \textit{Reasoning Ability}, and \textit{Long-horizon Ability} -- that isolate perception-to-action grounding, implicit intent resolution, and sustained multi-step coordination respectively. PRISM exposes an agent-agnostic executable action API that allows arbitrary agents: LLM agents, VLM agents, symbolic planners, RL policies, and hybrid systems, to be evaluated end-to-end under the same benchmark protocol. To support deeper diagnosis, optional probes for perception, memory, and planning can be adopted, replaced, or bypassed entirely, enabling controlled component-level analysis when desired. Experiments on seven contemporary LLMs establish a clear hierarchy: explicit spatial grounding is not the dominant failure source under oracle perception, implicit intent resolution is a significant bottleneck for all model families, and long-horizon coordination exposes a stark capability cliff -- lightweight models collapse to as low as 20.0\% success while simultaneously consuming more tokens than their frontier counterparts, a signature of compensatory over-reasoning rather than genuine planning capability. Project page: \href{https://sj-li.com/PROJ/PRISM}{link}.
Keywords
Related papers
How to Relieve Distribution Shifts in Semantic Segmentation for Off-Road Environments
Ji-Hoon Hwang, Daeyoung Kim, Hyung-Suk Yoon +2 more
2026
Uncertainty-guided evolvable recognition framework for industrial robots via prototype-based fuzzy inference and evidence fusion
Yanrun Zhou, Zihao Lei, Guangrui Wen +4 more
Robotics and Computer-Integrated Manufacturing · 2026
Point cloud registration for non-destructive, high-resolution coating thickness measurement from 3D scans
Simon Duenser, Ivo Aschwanden, Raamadaas Krishnadas +2 more
Robotics and Computer-Integrated Manufacturing · 2026
Toward the intelligent robotics era: Multimodal flexible haptic sensors for advanced perception systems
Sili Ding, Feng Xu, Jie Chen +3 more
Progress in Materials Science · 2026