Dense semantic forecasting anticipates future events in video by inferring pixel-level semantics of an unobserved future image. We present a novel approach that is applicable to various single-frame architectures and tasks. Our approach consists of two modules. Feature-to-motion (F2M) module forecasts a dense deformation field that warps past features into their future positions. Feature-to-feature (F2F) module regresses the future features directly and is therefore able to account for emergent scenery. The compound F2MF model decouples the effects of motion from the effects of novelty in a task-agnostic manner. We aim to apply F2MF forecasting to the most subsampled and the most abstract representation of a desired single-frame model. Our design takes advantage of deformable convolutions and spatial correlation coefficients across neighbouring time instants. We perform experiments on three dense prediction tasks: semantic segmentation, instance-level segmentation, and panoptic segmentation. The results reveal state-of-the-art forecasting accuracy across three dense prediction tasks.
翻译:通过推断未观察到的未来图像的像素级像素级语义学预示未来视频中的事件。 我们提出了一个适用于各种单一框架架构和任务的新颖方法。 我们的方法由两个模块组成。 地对地( F2M) 模块预测一个密集的变形场, 将过去的特点扭曲到他们的未来位置。 地对地( F2F) 模块直接回归未来特征, 因此能够对突发的场景进行核算。 复合 F2MF 模型以任务不可知的方式将运动的影响分解出来。 我们的目标是将F2MF 预测应用到最次抽样和最抽象的单一框架模型中。 我们的设计利用了相邻时间的变形变形变形和空间相关系数。 我们在三种密集的预测任务上进行了实验: 语系分解、 实例级分解和 光学分解。 结果揭示了三种密集预测任务中的状态预测准确性。