Vision transformers have emerged as powerful tools for many computer vision tasks. It has been shown that their features and class tokens can be used for salient object segmentation. However, the properties of segmentation transformers remain largely unstudied. In this work we conduct an in-depth study of the spatial attentions of different backbone layers of semantic segmentation transformers and uncover interesting properties. The spatial attentions of a patch intersecting with an object tend to concentrate within the object, whereas the attentions of larger, more uniform image areas rather follow a diffusive behavior. In other words, vision transformers trained to segment a fixed set of object classes generalize to objects well beyond this set. We exploit this by extracting heatmaps that can be used to segment unknown objects within diverse backgrounds, such as obstacles in traffic scenes. Our method is training-free and its computational overhead negligible. We use off-the-shelf transformers trained for street-scene segmentation to process other scene types.
翻译:视觉变压器已成为许多计算机视觉任务的强大工具。 事实证明, 它们的特性和类符号可以用于突出的物体分割。 但是, 分解变压器的特性基本上仍然未受研究。 我们在此工作中深入研究了不同语系分解变压器骨柱层的空间注意力, 发现了有趣的特性。 与物体相交的孔隙的空间注意力往往集中在物体内部, 而较大、 更统一的图像区域的注意力则跟随一种分辨行为。 换句话说, 视觉变压器经过训练, 将一组固定的物体分类分成一组, 向远超过此集的物体分类。 我们利用这些变压器, 提取热图解, 用于不同背景的未知物体, 如交通场的障碍。 我们的方法是没有训练, 其计算上的高空不计。 我们使用经过街头切除术训练的现成变压器处理其他类型的场景 。