Home /Research /Minimizing GPU Kernel Launch Overhead in Deep Learning Inference on Mobile GPUs
LEARNING

Minimizing GPU Kernel Launch Overhead in Deep Learning Inference on Mobile GPUs

Sumin Kim, Seunghwan Oh, Youngmin Yi

Year
2021
Citations
19

Abstract

The need for on-device real-time Deep Learning inference is increasing as deep learning on edge devices such as smartphones and robots are becoming popular. Although hardware acceleration on NPU is attracting more attention, the recent mobile GPUs are fast enough to provide the potential to achieve real-time inference of many CNNs. In this paper, we first analyze the inference time of the widely used CNNs on the recent mobile GPUs and reveal that significant overhead exists for the GPU kernel launches. Then, we identify various factors that cause the kernel launch overhead, from which we formulate a performance model that can predict the optimal period for the kernel flush that can lead to the minimal overhead. Our experimental results show that we could achieve up to 64% and 31% of speedups in the inference of various CNNs with TensorFlow Lite and ARM Compute Library on Adreno 650 GPU and Mali G76 GPU.

Keywords

Computer scienceKernel (algebra)InferenceOverhead (engineering)Deep learningArtificial intelligenceGeneral-purpose computing on graphics processing unitsMobile deviceCUDAParallel computing

Related papers

Browse all LEARNING papers