用可缩放变形器将对象与视频对象分割部分相连接 (Associating Objects with Scalable Transformers for Video Object Segmentation)

This paper investigates how to realize better and more efficient embedding learning to tackle the semi-supervised video object segmentation under challenging multi-object scenarios. The state-of-the-art methods learn to decode features with a single positive object and thus have to match and segment each target separately under multi-object scenarios, consuming multiple times computation resources. To solve the problem, we propose an Associating Objects with Transformers (AOT) approach to match and decode multiple objects jointly and collaboratively. In detail, AOT employs an identification mechanism to associate multiple targets into the same high-dimensional embedding space. Thus, we can simultaneously process multiple objects' matching and segmentation decoding as efficiently as processing a single object. To sufficiently model multi-object association, a Long Short-Term Transformer (LSTT) is devised to construct hierarchical matching and propagation. Based on AOT, we further propose a more flexible and robust framework, Associating Objects with Scalable Transformers (AOST), in which a scalable version of LSTT is designed to enable run-time adaptation of accuracy-efficiency trade-offs. Besides, AOST introduces a better layer-wise manner to couple identification and vision embeddings. We conduct extensive experiments on multi-object and single-object benchmarks to examine AOT series frameworks. Compared to the state-of-the-art competitors, our methods can maintain times of run-time efficiency with superior performance. Notably, we achieve new state-of-the-art performance on three popular benchmarks, i.e., YouTube-VOS (86.5%), DAVIS 2017 Val/Test (87.0%/84.7%), and DAVIS 2016 (93.0%). Project page: https://github.com/z-x-yang/AOT.

翻译：本文调查如何更好和更高效地嵌入学习,以便在具有挑战性的多目标情景下解决半监督的视频对象分割。因此, 最先进的方法可以学习用单一正对象解码特性, 从而必须在多目标假设情景下分别匹配和分割每个目标, 消耗多种时间计算资源。为了解决这个问题, 我们提议了一种与变压器( AOT) 连接和解码多个对象的方法。详细来说, AOT 使用一种识别机制, 将多个目标连接到同一个高维嵌入空间。因此, 最先进的方法可以同时处理多个目标的匹配和分割, 以处理一个单一对象的效率对象的方式进行解码。足够模拟多目标关联, 设计一个长期的短期变压器( LSTT) 来构建等级匹配和传播。基于 AOT, 我们进一步提议一个更灵活、更坚固的框架, 与可变压变压的变压器( AOST ), 3个可升级的LTTT( AOT) 设计新的版本, 来实现双向性更精确的升级的升级的升级的升级的升级的升级的贸易- 升级的 Veal- bal- develop- developmental- disal- disal- deal- dislationalational- dislational- disal- dislational- dislational- disal- disal- disalationalationalational- disal- disal- disal- dislational- disalvial- develmental- develdaldal- devel- deal- devel- develmental- sal- sal- saldaldal- saldaldaldaldal-sal- saldaldaldal- saldaldaldaldaldaldaldaldaldaldalbaldaldaldaldaldaldaldal- sal- sal- sal- sal- saldal- sal- sal- saldaldaldaldaldaldal- saldaldal-dal- saldal