General perception systems such as Perceivers can process arbitrary modalities in any combination and are able to handle up to a few hundred thousand inputs. They achieve this generality by using exclusively global attention operations. This however hinders them from scaling up to the inputs sizes required to process raw high-resolution images or video. In this paper, we show that some degree of locality can be introduced back into these models, greatly improving their efficiency while preserving their generality. To scale them further, we introduce a self-supervised approach that enables learning dense low-dimensional positional embeddings for very large signals. We call the resulting model a Hierarchical Perceiver (HiP). In sum our contributions are: 1) scaling Perceiver-type models to raw high-resolution images and audio+video, 2) showing the feasibility of learning 1M+ positional embeddings from scratch using masked auto-encoding, 3) demonstrating competitive performance on raw data from ImageNet, AudioSet, PASCAL VOC, ModelNet40 and Kinetics datasets with the same exact, unchanged model and without specialized preprocessing or any tokenization.
翻译:普通感知系统,如 Perceivers 等一般感知系统可以在任何组合中处理任意模式,并能够处理多达几十万个输入。 它们通过完全使用全球关注操作来实现这种一般化。 但是,这阻碍它们扩大到原始高分辨率图像或视频处理所需的投入规模。 在本文中, 我们显示可以将某种程度的地点引入这些模型, 大大提高其效率, 同时保持其一般性。 为了进一步推广, 我们引入了一种自我监督的方法, 从而能够学习密度高的低维定位嵌入非常大的信号。 我们称由此产生的模型为高端隐蔽( HiP ) 。 总之, 我们的贡献是:(1) 将 Perceiver 型模型推广到原始高分辨率图像和音频+视频所需的投入规模。 2) 展示了利用遮蔽的自动编码从抓中学习 1M+ 定位嵌入的可行性, 3) 展示图像网络、 音频Set、 PCAL VOC、 模型40 和 动因技术数据集的竞争性性表现。