首页 /研究 /Supervised Policy Update.
LOCOMOTION

Supervised Policy Update.

Quan Vuong, Yiming Zhang, Keith W. Ross

发表年份
2018
引用次数
3

摘要

We propose a new sample-efficient methodology, called Supervised Policy Update (SPU), for deep reinforcement learning. Starting with data generated by the current policy, SPU optimizes over the proximal policy space to find a non-parameterized policy. It then solves a supervised regression problem to convert the non-parameterized policy to a parameterized policy, from which it draws new samples. There is significant flexibility in setting the labels in the supervised regression problem, with different settings corresponding to different underlying optimization problems. We develop a methodology for finding an optimal policy in the non-parameterized policy space, and show how Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) can be addressed by this methodology. In terms of sample efficiency, our experiments show SPU can outperform PPO for simulated robotic locomotion tasks.

关键词

Parameterized complexityFlexibility (engineering)Reinforcement learningComputer scienceSample (material)Artificial intelligenceRegressionSupervised learningSpace (punctuation)Policy learning

相关论文

查看 LOCOMOTION 分类全部论文