Generative language models are trained on diverse, general domain corpora. However, this limits their applicability to narrower domains, and prior work has shown that continued in-domain training can provide further gains. In this paper, we introduce a method to scale domain adaptation to many diverse domains using a computationally efficient adapter approach. Our method is based on the observation that textual domains are partially overlapping, and we represent domains as a hierarchical tree structure where each node in the tree is associated with a set of adapter weights. When combined with a frozen pretrained language model, this approach enables parameter sharing among related domains, while avoiding negative interference between unrelated ones. It is efficient and computational cost scales as O(log(D)) for D domains. Experimental results with GPT-2 and a large fraction of the 100 most represented websites in C4 show across-the-board improvements in-domain. We additionally provide an inference time algorithm for a held-out domain and show that averaging over multiple paths through the tree enables further gains in generalization, while adding only a marginal cost to inference.
翻译:生成语言模型是针对不同、一般域域的共体进行培训的。 但是,这限制了它们的适用范围,而先前的工作表明,继续的内域培训可以带来进一步的好处。 在本文中,我们采用一种方法,使用计算高效的适配器方法,将域适应规模扩大到许多不同的域。我们的方法是基于这样的观察,即文字域部分重叠,我们将域作为树形结构的等级结构,其中树上的每个节点都与一套适配器重量挂钩。如果与冻结的预先训练的语言模型相结合,这种办法可以使相关域之间共享参数,同时避免互不相干的领域之间的负面干扰。这是高效和计算成本尺度,如D域的O(log(D))等。GPT-2的实验结果和C4中100个最有代表性的网站的一大部分显示了横跨板域的改进。我们为悬置域提供了一条推论时间算法,并表明通过树木的多条路平均可以取得普遍化的进一步收益,同时只增加边际成本。