Proximal Policy Optimization with Future rewards

Chengcheng Yu, Lijun Zhang, Dawei Yin, Dezhong Peng, Haixiao Huang

发表年份: 2021
引用次数: 3

摘要

Abstract Among the current reinforcement learning algorithms, the Policy Gradient algorithm (PG)[7] is one of the traditional and most widely used algorithms, but it has the disadvantage of unstable gradient estimation, and the newly Proximal Policy Optimization algorithm (PPO) [8]solves the problem. It solves the stability problem, but the update policy is slow, and it is easy to produce over-fitting when the training times are too many. In this article, a new method is proposed, referring to Asynchronous Advantage Actor-Critic (A3C)[9], the basic PPO algorithm is trained in parallel, and a method that considers future rewards is introduced, and the future rewards are also calculated to the current. In the reward, through the OPEN GYM platform and the experimental results of capturing at any position of the robotic arm, our actions can ensure a faster training speed, while also avoiding overfitting during long-term training.

关键词

OverfittingReinforcement learningComputer scienceStability (learning theory)Asynchronous communicationMathematical optimizationPosition (finance)DisadvantageArtificial intelligenceTrust region

Proximal Policy Optimization with Future rewards

摘要

关键词

相关论文

Statistical Learning Theory

Artificial intelligence: a modern approach

Applied Nonlinear Control

A new optimizer using particle swarm theory