首页 /研究 /How to Efficiently Train Your AI Agent? Characterizing and Evaluating Deep Reinforcement Learning on Heterogeneous Platforms
LEARNING

How to Efficiently Train Your AI Agent? Characterizing and Evaluating Deep Reinforcement Learning on Heterogeneous Platforms

Yuan Meng, Yang Yang, Sanmukh R. Kuppannagari, Rajgopal Kannan, Viktor K. Prasanna

发表年份
2020
引用次数
5

摘要

Deep Reinforcement Learning (Deep RL) is a key technology in several domains such as self-driving cars, robotics, surveillance, etc. In Deep RL, using a Deep Neural Network model, an agent learns how to interact with the environment to achieve a certain goal. The efficiency of running a Deep RL algorithm on a hardware architecture is dependent upon several factors including (1) the suitability of the hardware architecture for kernels and computation patterns which are fundamental to Deep RL; (2) the capability of the hardware architecture's memory hierarchy to minimize data-communication latency; and (3) the ability of the hardware architecture to hide overheads introduced by the deeply nested highly irregular computation characteristics in Deep RL algorithms. GPUs have been popular for accelerating RL algorithms, however, they fail to optimally satisfy the above-mentioned requirements. A few recent works have developed highly customized accelerators for specific Deep RL algorithms. However, they cannot be generalized easily to the plethora of Deep RL algorithms and DNN model choices that are available. In this paper, we explore the possibility of developing a unified framework that can accelerate a wide range of Deep RL algorithms including variations in training methods or DNN model structures. We take one step towards this goal by defining a domain-specific high-level abstraction for a widely used broad class of Deep RL algorithms - on-policy Deep RL. Furthermore, we provide a systematic analysis of the performance of state-of-the-art on-policy Deep RL algorithms on CPU-GPU and CPU-FPGA platforms. We target two representative algorithms - PPO and A2C, for application areas - robotics and games. we show that a FPGA-based custom accelerator achieves up to 24× (PPO) and 8× (A2C) speedups on training tasks, and 17× (PPO) and 2.1 × (A2C) improvements on overall throughput, respectively.

关键词

Computer scienceReinforcement learningDeep learningArtificial intelligenceDeep neural networksField-programmable gate arrayMemory hierarchyComputationAbstractionRobotics

相关论文

查看 LEARNING 分类全部论文