TY - JOUR
T1 - A High-Performance Accelerator for Super-Resolution Processing on Embedded GPU
AU - Zhao, Wenqian
AU - Bai, Yang
AU - Sun, Qi
AU - Li, Wenbo
AU - Zheng, Haisheng
AU - Jiang, Nianjuan
AU - Lu, Jiangbo
AU - Yu, Bei
AU - Wong, Martin D.F.
N1 - Funding Information:
This work was supported in part by the Research Grants Council of Hong Kong, SAR, under Grant CUHK24209017; in part by the Innovation and Technology Fund under Grant PRP/065/20FX; and in part by SmartMore. The preliminary version has been presented at the IEEE/ACM International Conference on Computer-Aided Design (ICCAD) in 2021 [1] [DOI: 10.1109/ICCAD51958.2021.9643472]. This article was recommended by Associate Editor T. Mitra
Publisher Copyright:
© 2023 IEEE.
PY - 2023/10
Y1 - 2023/10
N2 - Over the past few years, super-resolution (SR) processing has achieved astonishing progress along with the development of deep learning. Nevertheless, the rigorous requirement for real-time inference, especially for video tasks, leaves a harsh challenge for both the model architecture design and the hardware-level implementation. In this article, we propose a hardware-aware acceleration on embedded GPU devices as a full-stack SR deployment framework. The most critical stage with dictionary learning applied in SR flow was analyzed in details and optimized with a tailored dictionary slimming strategy. Moreover, we also delve into the programming architecture of hardware while analyzing the model structure to optimize the computation kernels to reduce inference latency and maximize the throughput given restricted computing power. In addition, we further accelerate the model with 8-bit integer inference by quantizing the weights in the compressed model. An adaptive 8-bit quantization flow for SR task enables the quantized model to achieve a comparable result with the full-precision baselines. With the help of our approaches, the computation and communication bottlenecks in the deep dictionary learning-based SR models can be overcome effectively. The experiments on both edge embedded device NVIDIA NX and 2080Ti prove that our framework exceeds the performance of state-of-the-art NVIDIA TensorRT significantly and can achieve real-time performance.
AB - Over the past few years, super-resolution (SR) processing has achieved astonishing progress along with the development of deep learning. Nevertheless, the rigorous requirement for real-time inference, especially for video tasks, leaves a harsh challenge for both the model architecture design and the hardware-level implementation. In this article, we propose a hardware-aware acceleration on embedded GPU devices as a full-stack SR deployment framework. The most critical stage with dictionary learning applied in SR flow was analyzed in details and optimized with a tailored dictionary slimming strategy. Moreover, we also delve into the programming architecture of hardware while analyzing the model structure to optimize the computation kernels to reduce inference latency and maximize the throughput given restricted computing power. In addition, we further accelerate the model with 8-bit integer inference by quantizing the weights in the compressed model. An adaptive 8-bit quantization flow for SR task enables the quantized model to achieve a comparable result with the full-precision baselines. With the help of our approaches, the computation and communication bottlenecks in the deep dictionary learning-based SR models can be overcome effectively. The experiments on both edge embedded device NVIDIA NX and 2080Ti prove that our framework exceeds the performance of state-of-the-art NVIDIA TensorRT significantly and can achieve real-time performance.
KW - Edge computing
KW - neural network compression
KW - super-resolution (SR)
UR - http://www.scopus.com/inward/record.url?scp=85148433748&partnerID=8YFLogxK
U2 - 10.1109/TCAD.2023.3241110
DO - 10.1109/TCAD.2023.3241110
M3 - Journal article
AN - SCOPUS:85148433748
SN - 0278-0070
VL - 42
SP - 3210
EP - 3223
JO - IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
JF - IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
IS - 10
ER -