Hindsight Trust Region Policy Optimization

Hanbo Zhang, Xuguang Lan, David Hsu, Nanning Zheng

发表年份: 2021
引用次数: 5
访问权限: 开放获取

摘要

Reinforcement Learning (RL) with sparse rewards is a major challenge. We pro- pose Hindsight Trust Region Policy Optimization (HTRPO), a new RL algorithm that extends the highly successful TRPO algorithm with hindsight to tackle the challenge of sparse rewards. Hindsight refers to the algorithm’s ability to learn from information across goals, including past goals not intended for the current task. We derive the hindsight form of TRPO, together with QKL, a quadratic approximation to the KL divergence constraint on the trust region. QKL reduces variance in KL divergence estimation and improves stability in policy updates. We show that HTRPO has similar convergence property as TRPO. We also present Hindsight Goal Filtering (HGF), which further improves the learning performance for suitable tasks. HTRPO has been evaluated on various sparse-reward tasks, including Atari games and simulated robot control. Experimental results show that HTRPO consistently outperforms TRPO, as well as HPG, a state-of-the-art policy 14 gradient algorithm for RL with sparse rewards.

关键词

Hindsight biasDivergence (linguistics)Reinforcement learningComputer scienceTask (project management)Constraint (computer-aided design)Mathematical optimizationTrust regionConvergence (economics)Variance (accounting)

Hindsight Trust Region Policy Optimization

摘要

关键词

相关论文

Statistical Learning Theory

Artificial intelligence: a modern approach

Applied Nonlinear Control

A new optimizer using particle swarm theory