This paper presents a novel cost aggregation network, called Volumetric Aggregation with Transformers (VAT), for few-shot segmentation. The use of transformers can benefit correlation map aggregation through self-attention over a global receptive field. However, the tokenization of a correlation map for transformer processing can be detrimental, because the discontinuity at token boundaries reduces the local context available near the token edges and decreases inductive bias. To address this problem, we propose a 4D Convolutional Swin Transformer, where a high-dimensional Swin Transformer is preceded by a series of small-kernel convolutions that impart local context to all pixels and introduce convolutional inductive bias. We additionally boost aggregation performance by applying transformers within a pyramidal structure, where aggregation at a coarser level guides aggregation at a finer level. Noise in the transformer output is then filtered in the subsequent decoder with the help of the query's appearance embedding. With this model, a new state-of-the-art is set for all the standard benchmarks in few-shot segmentation. It is shown that VAT attains state-of-the-art performance for semantic correspondence as well, where cost aggregation also plays a central role.
翻译:本文展示了一个新的成本汇总网络, 名为“ 与变异器的量子聚合( VAT ) ”, 用于几发分解。 变异器的使用可以通过全球可接受字段的自我注意使相关地图汇总受益。 然而, 变异器处理相关图的象征化可能有害, 因为象征性边界的不连续性会减少象征性边缘附近可用的本地环境, 并减少感应偏差。 为了解决这个问题, 我们提议了一个 4D 共变 Swin 变异器, 高维 Swin 变异器在它之前将一系列小内层的共变异器传播给所有像素, 并引入演化偏向性偏向性。 我们通过在金字塔结构中应用变异器来提高变异器的聚合性能, 将变异器的组合引导更细层次的聚合。 变异器输出的噪声在随后的解码器中过滤, 助查询器的出现嵌入。 有了这个模型, 一个新的状态变异器在几发断段的所有标准基准中设置了一系列小内层内, 并引入演进式偏向偏向偏移偏向。 显示VAT 的中央的状态表现的状态, 。