Recurrent neural networks have a strong inductive bias towards learning temporally compressed representations, as the entire history of a sequence is represented by a single vector. By contrast, Transformers have little inductive bias towards learning temporally compressed representations, as they allow for attention over all previously computed elements in a sequence. Having a more compressed representation of a sequence may be beneficial for generalization, as a high-level representation may be more easily re-used and re-purposed and will contain fewer irrelevant details. At the same time, excessive compression of representations comes at the cost of expressiveness. We propose a solution which divides computation into two streams. A slow stream that is recurrent in nature aims to learn a specialized and compressed representation, by forcing chunks of $K$ time steps into a single representation which is divided into multiple vectors. At the same time, a fast stream is parameterized as a Transformer to process chunks consisting of $K$ time-steps conditioned on the information in the slow-stream. In the proposed approach we hope to gain the expressiveness of the Transformer, while encouraging better compression and structuring of representations in the slow stream. We show the benefits of the proposed method in terms of improved sample efficiency and generalization performance as compared to various competitive baselines for visual perception and sequential decision making tasks.
翻译:经常性神经网络对学习时间压缩的表达方式有着强烈的偏向,因为一个序列的整个历史都是由单一矢量代表的。相反,变压器对学习时间压缩的表达方式没有多少暗示偏向,因为它们允许对先前计算的所有元素按顺序进行关注。更压缩一个序列的表达方式可能有利于概括化,因为高级别的表达方式可能更容易重新使用和重新使用,而且将包含较少无关的细节。同时,过分压缩表达方式是以表情为代价的。我们提出一种将计算分为两个流的解决方案。一个缓慢的流在性质上经常出现,目的是学习专门和压缩的表达方式,方法是将一小块美元的时间步骤强迫成一个单一的表达方式,将其分为多个矢量。同时,快速流的表达方式作为变压器的参数,以慢流中的信息为固定时间段为条件。我们希望在拟议的方法中取得变压器的清晰度,同时鼓励更好地压缩和构建一个简单的表达方式,目的是学习一个专门和压缩的表达方式,通过将一个简单的表达方式,将一个简单的表达方式分为多个矢量。我们以比较了各种递进式的进度展示方式展示了各种递进式的进度。