Home /Research /Butter-Bench: Evaluating LLM Controlled Robots for Practical Intelligence

OTHER

Butter-Bench: Evaluating LLM Controlled Robots for Practical Intelligence

Callum Sharrock, Lukas Petersson, Hanna Petersson, Axel Backlund, Axel Wennström, Kristoffer Nordström, Elias Aronsson

Year: 2025
Access: Open access

Abstract

We present Butter-Bench, a benchmark evaluating large language model (LLM) controlled robots for practical intelligence, defined as the ability to navigate the messiness of the physical world. Current state-of-the-art robotic systems use a hierarchical architecture with LLMs in charge of high-level reasoning, and a Vision Language Action (VLA) model for low-level control. Butter-Bench evaluates the LLM part in isolation from the VLA. Although LLMs have repeatedly surpassed humans in evaluations requiring analytical intelligence, we find humans still outperform LLMs on Butter-Bench. The best LLMs score 40% on Butter-Bench, while the mean human score is 95%. LLMs struggled the most with multi-step spatial planning and social understanding. We also evaluate LLMs that are fine-tuned for embodied reasoning and conclude that this training does not improve their score on Butter-Bench.

Keywords

cs.ROcs.AI

Butter-Bench: Evaluating LLM Controlled Robots for Practical Intelligence

Abstract

Keywords

Related papers

Statistical Learning Theory

Fractional Differential Equations

Applied Nonlinear Control

Genetic Programming: On the Programming of Computers by Means of Natural Selection