首页 /研究 /Learning Self-Correctable Policies and Value Functions from Demonstrations with Negative Sampling

LEARNING

Learning Self-Correctable Policies and Value Functions from Demonstrations with Negative Sampling

Yuping Luo, Huazhe Xu, Tengyu Ma

发表年份: 2019
引用次数: 4
访问权限: 开放获取

摘要

Imitation learning, followed by reinforcement learning algorithms, is a promising paradigm to solve complex control tasks sample-efficiently. However, learning from demonstrations often suffers from the covariate shift problem, which results in cascading errors of the learned policy. We introduce a notion of conservatively-extrapolated value functions, which provably lead to policies with self-correction. We design an algorithm Value Iteration with Negative Sampling (VINS) that practically learns such value functions with conservative extrapolation. We show that VINS can correct mistakes of the behavioral cloning policy on simulated robotics benchmark tasks. We also propose the algorithm of using VINS to initialize a reinforcement learning algorithm, which is shown to outperform significantly prior works in sample efficiency.

关键词

Reinforcement learningComputer scienceArtificial intelligenceBenchmark (surveying)ExtrapolationSample (material)Value (mathematics)Sampling (signal processing)Sample complexityCovariate

Learning Self-Correctable Policies and Value Functions from Demonstrations with Negative Sampling

摘要

关键词

相关论文

Statistical Learning Theory

Artificial intelligence: a modern approach

Applied Nonlinear Control

A new optimizer using particle swarm theory