The remarkable success of large language models has been driven by dense models trained on massive unlabeled, unstructured corpora. These corpora typically contain text from diverse, heterogeneous sources, but information about the source of the text is rarely used during training. Transferring their knowledge to a target domain is typically done by continuing training in-domain. In this paper, we introduce a method to permit domain adaptation to many diverse domains using a computationally efficient adapter approach. Our method is based on the observation that textual domains are partially overlapping, and we represent domains as a hierarchical tree structure where each node in the tree is associated with a set of adapter weights. When combined with a frozen pretrained language model, this approach enables parameter sharing among related domains, while avoiding negative interference between unrelated ones. Experimental results with GPT-2 and a large fraction of the 100 most represented websites in C4 show across-the-board improvements in-domain. We additionally provide an inference time algorithm for a held-out domain and show that averaging over multiple paths through the tree enables further gains in generalization, while adding only a marginal cost to inference.
翻译:大型语言模型的显著成功是由在大规模无标签、无结构的子公司上培训的密集模型驱动的。 这些公司通常包含来自多种不同来源的文本,但有关文本来源的信息在培训期间很少使用。 将其知识转移到目标领域通常是通过在内部继续培训完成的。 在本文中,我们采用一种方法,使用一种计算高效的适应器方法,允许对许多不同领域进行域的适应。 我们的方法基于这样的观察,即文字域部分重叠,我们将域作为树形结构的等级结构,其中树形每个节点都与一套适配器重量相联系。 当与一个冻结的预先训练的语言模型相结合时,这种方法能够使相关领域的参数共享,同时避免不相关的区域之间的负面干扰。 GPT-2的实验结果以及100个在C4中代表最多的网站的一大部分展示了全机式改进。 我们为悬置域提供了一种推论时间算法,并显示通过树形的多条路径可以进一步普及一般化,同时只能增加一个边际成本。