2.5 A 16nm 5.7TOPS CNN Processor Supporting Bi-Directional FPN for Small-Object Detection on High-Resolution Videos
Yu-Chun Ding, Chia-Yu Chang, Chun-Yeh Lin, Hui‐Yun Tsai, Hao-Jiun Tu, Yu‐Ching Su, Tsung‐Han Hsieh, Wen-Ching Chen, Nian-Shyang Chang, Chun‐Pin Lin, Chi‐Shi Chen, Chao-Tsung Huang
- Year
- 2025
- Citations
- 1
Abstract
Object detection is vital in intelligent systems like autonomous vehicles, UAVs, VR/AR, and smart robots. Detecting small objects is particularly crucial and can be life-saving for ADAS, as it helps maintain awareness of distant objects to ensure safe following distances. As illustrated in Fig. 2.5.1, distant pedestrians may appear as less than one thousand pixels in a 2M-resolution image, making them hard to detect with low-resolution inputs and shallow networks as supported in prior works. EfficientDet-D3 [6] significantly improves detection precision of small objects by using a high-resolution 896×896 input with its deep 77-layer backbone and advanced multi-layer stacked bidirectional feature pyramid network (Bi-FPN). Its mean average precision of small objects (mAPs) can achieve 28.7% for object areas under 32×32 pixels in the COCO dataset [7]. However, this precision comes with a substantial increase in memory costs on existing accelerators, limiting the wide deployment of accurate small-object detection. Additionally, diverse operations are involved in the deep backbone, introducing varied operational behaviors that reduce hardware efficiency. More computing power is also required for inference with deeper networks (greater numbers of layers) and higher resolution. In this work, we present a memory-efficient and energy-efficient CNN processor to support deep-layer backbone inference with Bi-FPN on high-resolution inputs for high-precision small-object detection. This chip features: 1) a flow-model co-optimized Bi-FPN implementation with orientation-interleaved causally-processed (OICP) modelling to reduce the memory cost for feature maps (FMs); 2) a bandwidth-optimized backbone scheduling with FM re-accessing and re-computing (RARC) to reduce external memory access (EMA); 3) a reconfigurable tensor engine (RTE) to improve compute utilization for diverse operations; and 4) a low-toggle sign-magnitude-two's-complement (SMTC) processing element (PE) design to reduce power consumption for MACs.
Keywords
Related papers
Statistical Learning Theory
Yuhai Wu, Vladimir Vapnik
1999
Artificial intelligence: a modern approach
1995
Applied Nonlinear Control
Jean-Jacques Slotine, Weiping Li
1991
A new optimizer using particle swarm theory
R.C. Eberhart, James Kennedy
2002