Real-world data is high-dimensional: a book, image, or musical performance can easily contain hundreds of thousands of elements even after compression. However, the most commonly used autoregressive models, Transformers, are prohibitively expensive to scale to the number of inputs and layers needed to capture this long-range structure. We develop Perceiver AR, an autoregressive, modality-agnostic architecture which uses cross-attention to map long-range inputs to a small number of latents while also maintaining end-to-end causal masking. Perceiver AR can directly attend to over a hundred thousand tokens, enabling practical long-context density estimation without the need for hand-crafted sparsity patterns or memory mechanisms. When trained on images or music, Perceiver AR generates outputs with clear long-term coherence and structure. Our architecture also obtains state-of-the-art likelihood on long-sequence benchmarks, including 64 x 64 ImageNet images and PG-19 books.
翻译:现实世界数据是高维的:即使压缩后,一本书、图像或音乐表演也可以很容易包含数十万元素。然而,最常用的自动递减模型变异器(变异器)对于测量这一远程结构所需的投入和层次数量而言,其成本非常昂贵。我们开发了一个自动递增的、模式-不可知性结构( Perceiver AR),这是一个自动递增的、模式-模式性的结构,它利用交叉注意将长程输入映射到少量的潜伏中,同时保持端到端的因果关系遮罩。 Perceiver AR 能够直接处理超过十万个符号, 使得实际的长文本密度估计无需手制的宽度模式或记忆机制。 在接受图像或音乐培训时, Perceiver AR 生成具有清晰的长期一致性和结构的输出。 我们的架构还获得长期基准(包括64 x 64 图像网络图像和 PG-19 书) 的状态- 。