CPU-GPU Layer-Switched Low Latency CNN Inference

E. Aghapour; D. Sapra; A. Pimentel; A. Pathania

doi:https://doi.org/10.1109/DSD57027.2022.00051

CPU-GPU Layer-Switched Low Latency CNN Inference

Authors	E. Aghapour D. Sapra A. Pimentel A. Pathania
Publication date	2022
Host editors	H. Fabelo S. Ortega A. Skavhaug
Book title	2022 25th Euromicro Conference on Digital System Design
Book subtitle	DSD 2022 : 31 August-2 September 2022, Maspalomas, Spain : proceedings
ISBN	9781665474054
ISBN (electronic)	9781665474047
Event	2022 25th Euromicro Conference on Digital System Design
Pages (from-to)	324-331
Publisher	Piscataway, NJ: IEEE Computer Society
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	Convolutional Neural Networks (CNNs) inference on Heterogeneous Multi-Processor System-on-Chips (HMPSoCs) in edge devices represent cutting-edge embedded machine learning. Embedded CPU and GPU within an HMPSoC can both perform inference using CNNs. However, common practice is to run a CNN on the HMPSoC component (CPU or GPU) provides the best performance (lowest latency) for that CNN. CNNs are not monolithic and are composed of several layers of different types. Some of these layers have lower latency on the CPU, while others execute faster on the GPU. In this work, we investigate the reason behind this observation. We also propose an execution of CNN that switches between CPU and GPU at the layer granularity, wherein a CNN layer executes on the component that provides it with the lowest latency. Switching between the CPU and the GPU back and forth mid-inference introduces additional overhead (delay) in the inference. Regardless of overhead, we show in this work that a CPU-GPU layer switched execution results in, on average, having 4.72% lower CNN inference latency on the Khadas VIM 3 board with Amlogic A311D HMPSoC.
Document type	Conference contribution
Language	English
Published at	https://doi.org/10.1109/DSD57027.2022.00051
Downloads	CPU-GPU_Layer-Switched_Low_Latency_CNN_Inference (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

CPU-GPU Layer-Switched Low Latency CNN Inference