Semantic segmentation is a well-addressed topic in the computer vision literature, but the design of fast and accurate video processing networks remains challenging. In addition, to run on embedded hardware, computer vision models often have to make compromises on accuracy to run at the required speed, so that a latency/accuracy trade-off is usually at the heart of these real-time systems' design. For the specific case of videos, models have the additional possibility to make use of computations made for previous frames to mitigate the accuracy loss while being real-time. In this work, we propose to tackle the task of fast future video segmentation prediction through the use of convolutional layers with time-dependent channel masking. This technique only updates a chosen subset of the feature maps at each time-step, bringing simultaneously less computation and latency, and allowing the network to leverage previously computed features. We apply this technique to several fast architectures and experimentally confirm its benefits for the future prediction subtask.
翻译:在计算机视觉文献中,语义分割是一个很好处理的专题,但快速和准确的视频处理网络的设计仍然具有挑战性。此外,如果使用嵌入式硬件运行,计算机视觉模型往往不得不在精确性上做出妥协,以便以所要求的速度运行,这样,对于这些实时系统的设计来说,长期/准确性权衡通常是这些实时系统设计的核心。对于视频的具体案例,模型还有额外的可能性,可以对以前的框架进行计算,以减少实时的准确性损失。在这项工作中,我们提议通过使用以时间为依存的频道遮罩的相层来处理快速的未来视频分割预测任务。这种技术只能更新每个时间步骤所选的地貌图,同时减少计算和耐久性,并允许网络利用先前计算的特征。我们把这一技术应用于几个快速的架构,并实验性地确认其对未来子任务预测的好处。