This paper investigates how to realize better and more efficient embedding learning to tackle the semi-supervised video object segmentation under challenging multi-object scenarios. The state-of-the-art methods learn to decode features with a single positive object and thus have to match and segment each target separately under multi-object scenarios, consuming multiple times computing resources. To solve the problem, we propose an Associating Objects with Transformers (AOT) approach to match and decode multiple objects uniformly. In detail, AOT employs an identification mechanism to associate multiple targets into the same high-dimensional embedding space. Thus, we can simultaneously process the matching and segmentation decoding of multiple objects as efficiently as processing a single object. For sufficiently modeling multi-object association, a Long Short-Term Transformer is designed for constructing hierarchical matching and propagation. We conduct extensive experiments on both multi-object and single-object benchmarks to examine AOT variant networks with different complexities. Particularly, our AOT-L outperforms all the state-of-the-art competitors on three popular benchmarks, i.e., YouTube-VOS (83.7% J&F), DAVIS 2017 (83.0%), and DAVIS 2016 (91.0%), while keeping more than 3X faster multi-object run-time. Meanwhile, our AOT-T can maintain real-time multi-object speed on the above benchmarks. We ranked 1st in the 3rd Large-scale Video Object Segmentation Challenge. The code will be publicly available at https://github.com/z-x-yang/AOT.
翻译:本文调查如何更好和更高效地嵌入学习,以在具有挑战性的多目标情景下解决半监督的视频对象分割。 最先进的方法可以学习用单一正对象解码特性, 从而必须在多目标情景下分别对每个目标进行匹配和分割, 耗用多次计算资源。 为了解决这个问题, 我们提议了一种使用变换器( AOT) 连接对象的方法, 以统一匹配和解码多个对象。 详细来说, AOT 使用一种识别机制, 将多个目标连接到同一高维嵌入空间。 因此, 我们可以同时以处理单个对象的效率处理多个对象的匹配和分割解码。 对于足够建模多目标组合, 长短期变换器的设计是为了构建等级匹配和传播。 我们用不同复杂的多对象和单项基准进行广泛的实验。 特别是, 我们的AOT- L 超越了所有州- 艺术竞争者在三个通用基准, i., YouTu- VOT- OODO- deco- decodeal A- million AV- dival- hillal- dival- dival- la- lab- lab- lax- lax- lax- lax- lax- lax- lib- dal- dal- dival- lab- lab- lax- lab- lib- lab- lax- lax- lib- lib- lib- lib- lib- lax- lax- lax- lax- lax- lib- lax- lib- lax- lax- lax- lib- lax- lax- lax- lax- lax- dal- lax- lax- lax- lax- lax- lax- lax- lax- lax- lax- lax- lax- lax- lax- lax- lax- lax- lax- lax- lax- lax- lax-