FADEC:HW/SW联合设计以FPGA为基础的视频深度估计加速 (FADEC: FPGA-based Acceleration of Video Depth Estimation by HW/SW Co-design)

from arxiv, 9 pages, 8 figures, 3 tables, FPT 2022 (Full paper), Program: https://fpt22.hkust.edu.hk/program#tools, GitHub: https://github.com/casys-utokyo/fadec, Slides: https://speakerdeck.com/hashi0203/sw-co-design-fpt-2022-8082a83d-3167-461c-8560-60f77959a3d5, Movie: https://youtu.be/NFULXQeu6Vw, Profile: https://n-hassy.info

3D reconstruction from videos has become increasingly popular for various applications, including navigation for autonomous driving of robots and drones, augmented reality (AR), and 3D modeling. This task often combines traditional image/video processing algorithms and deep neural networks (DNNs). Although recent developments in deep learning have improved the accuracy of the task, the large number of calculations involved results in low computation speed and high power consumption. Although there are various domain-specific hardware accelerators for DNNs, it is not easy to accelerate the entire process of applications that alternate between traditional image/video processing algorithms and DNNs. Thus, FPGA-based end-to-end acceleration is required for such complicated applications in low-power embedded environments. This paper proposes a novel FPGA-based accelerator for DeepVideoMVS, a DNN-based depth estimation method for 3D reconstruction. We employ HW/SW co-design to appropriately utilize heterogeneous components in modern SoC FPGAs, such as programmable logic (PL) and CPU, according to the inherent characteristics of the method. As some operations are unsuitable for hardware implementation, we determine the operations to be implemented in software through analyzing the number of times each operation is performed and its memory access pattern, and then considering comprehensive aspects: the ease of hardware implementation and degree of expected acceleration by hardware. The hardware and software implementations are executed in parallel on the PL and CPU to hide their execution latencies. The proposed accelerator was developed on a Xilinx ZCU104 board by using NNgen, an open-source high-level synthesis (HLS) tool. Experiments showed that the proposed accelerator operates 60.2 times faster than the software-only implementation on the same FPGA board with minimal accuracy degradation.

翻译：视频的3D重建越来越为各种应用程序所欢迎,包括自动驾驶机器人和无人机的导航、增强现实(AR)和3D建模。这一任务往往将传统的图像/视频处理算法和深神经网络(DNNS)结合起来。虽然最近深层学习的发展提高了任务的准确性,但大量计算的结果导致计算速度低和电耗高。虽然DNNP有各种特定域的硬件加速器,但加快传统图像/视频处理算法和DNNW之间交替的整个应用过程并非易事。因此,基于FPGA的终端到终端加速对于在低电源嵌入环境中的复杂应用往往需要。虽然最近深层学习的发展提高了任务的准确性,但大量计算的结果导致3D重建的基于DNNNUD的深度估算方法。我们使用HW/SW的组合点,在现代的SFPGA中适当使用混杂部件,例如可编算逻辑(PL)和CPU。因此,根据方法的内在特性特性,PGGA的终端到终端的终端速度速度速度速度速度速度速度速度需要。一些运行到硬体运行到时间。