Despite receiving significant attention from the research community, the task of segmenting and tracking objects in monocular videos still has much room for improvement. Existing works have simultaneously justified the efficacy of dilated and deformable convolutions for various image-level segmentation tasks. This gives reason to believe that 3D extensions of such convolutions should also yield performance improvements for video-level segmentation tasks. However, this aspect has not yet been explored thoroughly in existing literature. In this paper, we propose Dynamic Dilated Convolutions (D^2Conv3D): a novel type of convolution which draws inspiration from dilated and deformable convolutions and extends them to the 3D (spatio-temporal) domain. We experimentally show that D^2Conv3D can be used to improve the performance of multiple 3D CNN architectures across multiple video segmentation related benchmarks by simply employing D^2Conv3D as a drop-in replacement for standard convolutions. We further show that D^2Conv3D out-performs trivial extensions of existing dilated and deformable convolutions to 3D. Lastly, we set a new state-of-the-art on the DAVIS 2016 Unsupervised Video Object Segmentation benchmark. Code is made publicly available at https://github.com/Schmiddo/d2conv3d .
翻译:尽管得到了研究界的极大关注,但单视视频中的分割和跟踪对象的任务仍有很大的改进余地。现有的工程同时证明各种图像层分割任务的扩展和变形变异变异的功效。这使人们有理由相信,这种变异的3D扩展也应为视频层分割任务带来性能改进。然而,现有文献尚未彻底探讨这一方面。我们在此文件中提议动态变幻变幻变异(D§2Conv3D):一种新型变幻变,从变异变变变变变的变异中汲取灵感,并将其扩展至3D(空间-时空)域。我们实验性地显示,D§2Conv3D可以简单地利用D§2Conv3D作为标准变异变变变变变的替代工具,来改善多个3DCN结构在视频层分割相关基准方面的性能。我们进一步展示D2Conv3D将现有变异变异变变变变变变变变变变变变的微变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变变形变形变形变变变变形变形变形变形变形变变变变变变变变变形变形变形变形变形变形变形变形变变形变形变形变变变形变形变形变形变形变形变形变变变变变形变形变形变形变形变