Automation of noise sampling in deep reinforcement learning

Kunal Karda, Namit Dubey, Abhas Kanungo, Varun Gupta

发表年份: 2022
引用次数: 8

摘要

The actor-critic models are generally prone to overestimation of sub-optimal policies and Q-values. Our proposed approach is established on value-based deep reinforcement learning algorithm also known as twin delayed deep deterministic policy gradient algorithm or TD3. The suggested approach is used to solve complex reinforcement learning problem like half-humanoid robot, ant, and half-cheetah to cover a path. This problem can only be solved with an algorithm which can work on continuous-action spaces, without much delaying the result to propagate during the inference of model. The proposed model has been adapted to converge faster to optimal Q-values. The TD3 uses two deep neural networks for learning two Q-values, viz., Q1 and Q2; in the proposed approach the Q-values average is being taken as an input for final Q-value unlike the other reinforcement learning algorithm such as DDPG which is prone to overestimate the Q-values. The proposed approach has also made self-adjusting noise clipping function, which make it harder for the policy to exploit Q-function errors to further improve performance.

关键词

Reinforcement learningQ-learningComputer scienceNoise (video)Bellman equationArtificial neural networkArtificial intelligenceFunction (biology)Mathematical optimizationAlgorithm

Automation of noise sampling in deep reinforcement learning

摘要

关键词

相关论文

Statistical Learning Theory

Artificial intelligence: a modern approach

Applied Nonlinear Control

A new optimizer using particle swarm theory