General perception systems such as Perceivers can process arbitrary modalities in any combination and are able to handle up to a few hundred thousand inputs. They achieve this generality by exclusively using global attention operations. This however hinders them from scaling up to the inputs sizes required to process raw high-resolution images or video. In this paper, we show that some degree of locality can be introduced back into these models, greatly improving their efficiency while preserving their generality. To scale them further, we introduce a self-supervised approach that enables learning dense low-dimensional positional embeddings for very large signals. We call the resulting model a Hierarchical Perceiver (HiP). HiP retains the ability to process arbitrary modalities, but now at higher-resolution and without any specialized preprocessing, improving over flat Perceivers in both efficiency and accuracy on the ImageNet, Audioset and PASCAL VOC datasets.
翻译:普通感知系统,如 Perceivers 等一般感知系统,可以处理任意的组合模式,并能处理多达几十万个输入。 它们通过专门使用全球注意力操作来实现这种普遍性。 但是,这阻碍它们将输入量提高到原始高分辨率图像或视频所需的输入量大小。 在本文中, 我们显示可以将某种程度的地点引入这些模型, 大大提高它们的效率, 同时保持其一般性。 为了进一步推广, 我们引入了一种自我监督的方法, 以便学习密度大的低维定位嵌入非常大的信号。 我们称该模型为“ 高纬度 Perceiver ” ( HIP) 。 HIP 保留了处理任意模式的能力, 但是现在是更高分辨率的, 并且没有任何专门的预处理, 在图像网络、 音频集 和 PASAL VOC 数据集的效率和准确性方面改进了平坦的 Perceivers。