In this paper, we investigate how field programmable gate arrays can serve as hardware accelerators for real-time semantic segmentation tasks relevant for autonomous driving. Considering compressed versions of the ENet convolutional neural network architecture, we demonstrate a fully-on-chip deployment with a latency of 4.9 ms per image, using less than 30% of the available resources on a Xilinx ZCU102 evaluation board. The latency is reduced to 3 ms per image when increasing the batch size to ten, corresponding to the use case where the autonomous vehicle receives inputs from multiple cameras simultaneously. We show, through aggressive filter reduction and heterogeneous quantization-aware training, and an optimized implementation of convolutional layers, that the power consumption and resource utilization can be significantly reduced while maintaining accuracy on the Cityscapes dataset.
翻译:在本文中,我们调查了实地可编程门阵列如何作为与自主驾驶相关的实时语义分割任务的硬件加速器。考虑到ENet 神经网络结构的压缩版本,我们展示了每张图像的全晶片部署时间为4.9毫秒,在Xilinx ZCU102 评估委员会上使用不到30%的可用资源。当将批量尺寸提高到10米时,每张图像的延迟度降低到3毫秒,这与自动车辆同时从多个摄像头接收投入的情况相对应。我们通过积极的过滤减少和多倍分量意识培训以及优化革命层的实施,显示在保持城景数据集的准确性的同时,能大大减少电力消耗和资源利用。