As scene segmentation systems reach visually accurate results, many recent papers focus on making these network architectures faster, smaller and more efficient. In particular, studies often aim at designingreal-time'systems. Achieving this goal is particularly relevant in the context of real-time video understanding for autonomous vehicles, and robots. In this paper, we argue that the commonly used performance metric of mean Intersection over Union (mIoU) does not fully capture the information required to estimate the true performance of these networks when they operate inreal-time'. We propose a change of objective in the segmentation task, and its associated metric that encapsulates this missing information in the following way: We propose to predict the future output segmentation map that will match the future input frame at the time when the network finishes the processing. We introduce the associated latency-aware metric, from which we can determine a ranking. We perform latency timing experiments of some recent networks on different hardware and assess the performances of these networks on our proposed task. We propose improvements to scene segmentation networks to better perform on our task by using multi-frames input and increasing capacity in the initial convolutional layers.
翻译:由于场景分解系统达到可见的准确结果,许多最近的论文侧重于使这些网络结构更快、更小、更有效率。特别是,研究往往旨在设计实时系统。实现这一目标对于自动车辆和机器人的实时视频理解而言特别重要。我们在本文件中认为,共同使用的联盟内平均交叉体(mIoU)的性能衡量标准不能充分捕捉估计这些网络在实时运行时的真实性能所需的信息。我们提议改变分解任务的目标及其相关的衡量标准,以下列方式概括这种缺失的信息:我们提议预测未来产出分解图,该图将与网络完成处理时的未来输入框架相匹配。我们从中可以确定一个等级。我们对最近一些网络在不同硬件上进行延时的定时实验,并评估这些网络在拟议任务上的性能。我们提议改进现场分解网络,以便利用多框架的投入和增加初始革命层的能力,更好地完成我们的任务。