Supervised Policy Update.

Quan Vuong, Yiming Zhang, Keith W. Ross

发表年份: 2018
引用次数: 3

摘要

We propose a new sample-efficient methodology, called Supervised Policy Update (SPU), for deep reinforcement learning. Starting with data generated by the current policy, SPU optimizes over the proximal policy space to find a non-parameterized policy. It then solves a supervised regression problem to convert the non-parameterized policy to a parameterized policy, from which it draws new samples. There is significant flexibility in setting the labels in the supervised regression problem, with different settings corresponding to different underlying optimization problems. We develop a methodology for finding an optimal policy in the non-parameterized policy space, and show how Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) can be addressed by this methodology. In terms of sample efficiency, our experiments show SPU can outperform PPO for simulated robotic locomotion tasks.

关键词

Parameterized complexityFlexibility (engineering)Reinforcement learningComputer scienceSample (material)Artificial intelligenceRegressionSupervised learningSpace (punctuation)Policy learning

Supervised Policy Update.

摘要

关键词

相关论文

Statistical Learning Theory

Artificial intelligence: a modern approach

Applied Nonlinear Control

A new optimizer using particle swarm theory