Home /Research /Customised pearlmutter propagation: A hardware architecture for trust region policy optimisation
LOCOMOTION

Customised pearlmutter propagation: A hardware architecture for trust region policy optimisation

Shengjia Shao, Wayne Luk

Year
2017
Citations
15

Abstract

Reinforcement Learning (RL) is an area of machine learning in which an agent interacts with the environment by making sequential decisions. The agent receives reward from the environment to find an optimal policy that maximises the reward. Trust Region Policy Optimisation (TRPO) is a recent policy optimisation algorithm that achieves superior results in various RL benchmarks, but is computationally expensive. This paper proposes Customised Pearlmutter Propagation (CPP), a novel hardware architecture that accelerates TRPO on FPGA. We use the Pearlmutter Algorithm to address the key computational bottleneck of TRPO in a hardware efficient manner, avoiding symbolic differentiation with change of variables. Experimental evaluation using robotic locomotion benchmarks demonstrates that the proposed CPP architecture implemented on Stratix-V FPGA can achieve up to 20 times speed-up against 6-threaded Keras deep learning library with Theano backend running on a Core i7-5930K CPU.

Keywords

StratixBottleneckReinforcement learningField-programmable gate arrayComputer scienceArchitectureKey (lock)Hardware architectureSpeedupComputer architecture

Related papers

Browse all LOCOMOTION papers