Transformers have become prevalent in computer vision due to their performance and flexibility in modelling complex operations. Of particular significance is the 'cross-attention' operation, which allows a vector representation (e.g. of an object in an image) to be learned by attending to an arbitrarily sized set of input features. Recently, "Masked Attention" was proposed in which a given object representation only attends to those image pixel features for which the segmentation mask of that object is active. This specialization of attention proved beneficial for various image and video segmentation tasks. In this paper, we propose another specialization of attention which enables attending over `soft-masks' (those with continuous mask probabilities instead of binary values), and is also differentiable through these mask probabilities, thus allowing the mask used for attention to be learned within the network without requiring direct loss supervision. This can be useful for several applications. Specifically, we employ our "Differentiable Soft-Masked Attention" for the task of Weakly-Supervised Video Object Segmentation (VOS), where we develop a transformer-based network for VOS which only requires a single annotated image frame for training, but can also benefit from cycle consistency training on a video with just one annotated frame. Although there is no loss for masks in unlabeled frames, the network is still able to segment objects in those frames due to our novel attention formulation.
翻译:计算机图像变异器因其在模拟复杂操作中的性能和灵活性而在计算机视觉中变得很普遍。 特别重要的是“ 交叉注意”操作, 允许通过使用任意尺寸的输入特征来学习矢量代表( 图像中的物体) 。 最近, “ 吸引注意” 提议, 特定对象代表只关注该物体的分解面罩正在运行的图像像素特征。 这种关注的专业化对于各种图像和视频分割任务十分有益。 在本文中, 我们提议了另一种关注的专业化, 使得能够参加“ 软质” ( 带有持续遮罩概率而不是二进制值的物体), 并且通过这些掩码概率来学习矢量代表( 图像中的物体) 。 因此, “ 吸引注意” 提议了“ 特定对象只关注该物体的分解面外观。 具体地说, 我们用我们的“ 可调控软化软化的注意” 来完成Weakly- Suppervial 对象截图案( VES) 的任务。 我们开发的变压式的网络, 只能用来在VOS 格式上建立一个不具有注释的标签结构, 。