We propose a Vision-Language Transformer (VLT) framework for referring segmentation to facilitate deep interactions among multi-modal information and enhance the holistic understanding to vision-language features. There are different ways to understand the dynamic emphasis of a language expression, especially when interacting with the image. However, the learned queries in existing transformer works are fixed after training, which cannot cope with the randomness and huge diversity of the language expressions. To address this issue, we propose a Query Generation Module, which dynamically produces multiple sets of input-specific queries to represent the diverse comprehensions of language expression. To find the best among these diverse comprehensions, so as to generate a better mask, we propose a Query Balance Module to selectively fuse the corresponding responses of the set of queries. Furthermore, to enhance the model's ability in dealing with diverse language expressions, we consider inter-sample learning to explicitly endow the model with knowledge of understanding different language expressions to the same object. We introduce masked contrastive learning to narrow down the features of different expressions for the same target object while distinguishing the features of different objects. The proposed approach is lightweight and achieves new state-of-the-art referring segmentation results consistently on five datasets.
翻译:我们提议了一个“视觉语言变换器”框架,用于参考分解,以便利多模式信息之间的深度互动,并增进对视觉语言特征的全面理解。有不同的方式来理解语言表达的动态强调,特别是在与图像互动时。然而,现有变压器工程中学习的询问是在培训后固定的,无法应对语言表达方式的随机性和巨大多样性。为解决这一问题,我们提议了一个“Query General”模块,该模块动态生成多组特定输入的询问,以代表语言表达方式的不同理解。为了在这些不同的理解中找到最佳的,以便产生更好的掩码,我们提议了一个“查询平衡模块”,以便有选择地结合对一组查询的相应回应。此外,为了提高模型处理不同语言表达方式的能力,我们考虑跨类学习将理解不同语言表达方式的知识明确赋予同一对象。我们引入了隐蔽的对比学习,以缩小同一目标对象不同表达方式的特征,同时区分不同对象的特征。拟议的方法是“光度”和“新状态”系指光度。