Language use differs dramatically from context to context. To some degree, modern language models like GPT-3 are able to account for such variance by conditioning on a string of previous input text, or prompt. Yet prompting is ineffective when contexts are sparse, out-of-sample, or extra-textual; for instance, accounting for when and where the text was produced or who produced it. In this paper, we introduce the mixed-effects transformer (MET), a novel approach for learning hierarchically-structured prefixes -- lightweight modules prepended to the input -- to account for structured variation. Specifically, we show how the popular class of mixed-effects models may be extended to transformer-based architectures using a regularized prefix-tuning procedure with dropout. We evaluate this approach on several domain-adaptation benchmarks, finding that it efficiently adapts to novel contexts with minimal data while still effectively generalizing to unseen contexts.
翻译:语言使用在上下文之间差别很大。 在某种程度上, GPT-3 等现代语言模式可以通过对先前输入文本的一连串或快速进行调节来解释这种差异。 然而,当环境稀少、没有样本或超文本时,提示是无效的; 例如, 说明文本的制作时间和地点或由谁制作。 在本文中, 我们引入了混合效应变压器(MET), 这是一种学习分级结构前缀的新颖方法 -- -- 输入预设的轻量模块 -- -- 以说明结构变异。 具体地说, 我们展示了如何利用常规化的前缀调控程序将流行的混合效应模型类别扩大到基于变压器的结构。 我们根据几个领域适应基准来评估这一方法, 发现它有效地适应了新环境, 并有少量数据, 同时有效地将数据推广到看不见的环境 。