首页 /研究 /Search-Based Credit Assignment for Offline Preference-Based Reinforcement Learning
MANIPULATION

Search-Based Credit Assignment for Offline Preference-Based Reinforcement Learning

Xiancheng Gao, Yufeng Shi, Wengang Zhou, Houqiang Li

发表年份
2025
访问权限
开放获取

摘要

Offline reinforcement learning refers to the process of learning policies from fixed datasets, without requiring additional environment interaction. However, it often relies on well-defined reward functions, which are difficult and expensive to design. Human feedback is an appealing alternative, but its two common forms, expert demonstrations and preferences, have complementary limitations. Demonstrations provide stepwise supervision, but they are costly to collect and often reflect limited expert behavior modes. In contrast, preferences are easier to collect, but it is unclear which parts of a behavior contribute most to a trajectory segment, leaving credit assignment unresolved. In this paper, we introduce a Search-Based Preference Weighting (SPW) scheme to unify these two feedback sources. For each transition in a preference labeled trajectory, SPW searches for the most similar state-action pairs from expert demonstrations and directly derives stepwise importance weights based on their similarity scores. These weights are then used to guide standard preference learning, enabling more accurate credit assignment that traditional approaches struggle to achieve. We demonstrate that SPW enables effective joint learning from preferences and demonstrations, outperforming prior methods that leverage both feedback types on challenging robot manipulation tasks.

关键词

cs.AIcs.LG

相关论文

查看 MANIPULATION 分类全部论文