In this paper, we propose a highly parameter-efficient approach to scaling pre-trained language models (PLMs) to a deeper model depth. Unlike prior work that shares all parameters or uses extra blocks, we design a more capable parameter-sharing architecture based on matrix product operator (MPO). MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts: the major part that contains the major information (central tensor) and the supplementary part that only has a small proportion of parameters (auxiliary tensors). Based on such a decomposition, our architecture shares the central tensor across all layers for reducing the model size and meanwhile keeps layer-specific auxiliary tensors (also using adapters) for enhancing the adaptation flexibility. To improve the model training, we further propose a stable initialization algorithm tailored for the MPO-based architecture. Extensive experiments have demonstrated the effectiveness of our proposed model in reducing the model size and achieving highly competitive performance.
翻译:本文提出一种高度参数高效的方法将预训练语言模型(PLM)扩展到更深的模型深度。与以前通过共享所有参数或使用额外块的方法不同,我们设计了一种更具能力的基于矩阵乘积算子(MPO)的参数共享架构。MPO分解可以将参数矩阵的信息重新组织和因子化为两部分:包含主要信息(中心张量)的主要部分和只有少部分参数的辅助张量的补充部分。在这种分解的基础上,我们的架构通过所有层共享中心张量以减少模型大小,同时保持具有适应性的特定于每一层的辅助张量(还要使用适配器)以增强适应性的灵活性。为了改善模型训练,我们进一步提出了一种针对基于MPO的架构量身定制的稳定初始化算法。大量实验已经证明了我们提出的模型在减少模型大小和实现高度竞争性能方面的有效性。