Video Object Segmentation (VOS) is fundamental to video understanding. Transformer-based methods show significant performance improvement on semi-supervised VOS. However, existing work faces challenges segmenting visually similar objects in close proximity of each other. In this paper, we propose a novel Bilateral Attention Transformer in Motion-Appearance Neighboring space (BATMAN) for semi-supervised VOS. It captures object motion in the video via a novel optical flow calibration module that fuses the segmentation mask with optical flow estimation to improve within-object optical flow smoothness and reduce noise at object boundaries. This calibrated optical flow is then employed in our novel bilateral attention, which computes the correspondence between the query and reference frames in the neighboring bilateral space considering both motion and appearance. Extensive experiments validate the effectiveness of BATMAN architecture by outperforming all existing state-of-the-art on all four popular VOS benchmarks: Youtube-VOS 2019 (85.0%), Youtube-VOS 2018 (85.3%), DAVIS 2017Val/Testdev (86.2%/82.2%), and DAVIS 2016 (92.5%).
翻译:视频对象分割( VOS) 是视频理解的基础 。 以变异器为基础的方法显示半受监督VOS 上的性能显著改善。 但是, 现有工作面临着将相近的视觉相似对象进行分解的挑战 。 在本文中, 我们提议为半受监督的 VOS 提供一个新的双向关注转换器( BATMAN ) 。 它通过一个新型的光学流校准模块在视频中捕捉物体运动, 该模块将分隔面罩与光学流估测结合起来,以改善透视光流的顺利性,并减少物体边界上的噪音 。 这种校准光学流随后用于我们的新颖的双边关注中,它考虑到运动和外观,计算了相邻双边空间的查询和参考框架之间的对应关系。 广泛的实验通过在VOS 全部四个流行基准上超越所有现有状态技术来验证 BATMAN 结构的有效性: Youtube- VOS 2019 (85.0% ), Youtube- VOS 2018 (85.3%), DAVIS 2017/ 2017 (85.3%), DAVIS Val/ Testevevevevevevevev(86.2%) 86.2%), DAVed(86.2%) 86.2%) 18)。