One way of introducing sparsity into deep networks is by attaching an external table of parameters that is sparsely looked up at different layers of the network. By storing the bulk of the parameters in the external table, one can increase the capacity of the model without necessarily increasing the inference time. Two crucial questions in this setting are then: what is the lookup function for accessing the table and how are the contents of the table consumed? Prominent methods for accessing the table include 1) using words/wordpieces token-ids as table indices, 2) LSH hashing the token vector in each layer into a table of buckets, and 3) learnable softmax style routing to a table entry. The ways to consume the contents include adding/concatenating to input representation, and using the contents as expert networks that specialize to different inputs. In this work, we conduct rigorous experimental evaluations of existing ideas and their combinations. We also introduce a new method, alternating updates, that enables access to an increased token dimension without increasing the computation time, and demonstrate its effectiveness in language modeling.
翻译:将宽度引入深层网络的一种方式是附加一个外部参数表,该表对网络的不同层次进行细微的审视。通过将大部分参数存储在外部表格中,可以提高模型的能力,而不必增加推算时间。在此背景下,有两个关键问题是:访问表格的外观功能是什么,表格的内容如何?访问表格的突出方法包括:(1) 使用单词/字件代号作为表格指数;(2) LSH将每一层的代号矢量散到桶的表格中;(3) 将可学习的软式风格移到表格条目中。消费内容的方法包括增加/配置投入代表,以及使用内容作为专家网络,对不同投入进行专门化。在这项工作中,我们对现有想法及其组合进行了严格的实验性评估。我们还引入了一种新的方法,即交替更新,以便能够在不增加计算时间的情况下获取更多的代号,并在语言模型中展示其有效性。