Recent improvements in the predictive quality of natural language processing systems are often dependent on a substantial increase in the number of model parameters. This has led to various attempts of compressing such models, but existing methods have not considered the differences in the predictive power of various model components or in the generalizability of the compressed models. To understand the connection between model compression and out-of-distribution generalization, we define the task of compressing language representation models such that they perform best in a domain adaptation setting. We choose to address this problem from a causal perspective, attempting to estimate the average treatment effect (ATE) of a model component, such as a single layer, on the model's predictions. Our proposed ATE-guided Model Compression scheme (AMoC), generates many model candidates, differing by the model components that were removed. Then, we select the best candidate through a stepwise regression model that utilizes the ATE to predict the expected performance on the target domain. AMoC outperforms strong baselines on dozens of domain pairs across three text classification and sequence tagging tasks.
翻译:最近自然语言处理系统的预测质量的改善往往取决于模型参数数目的大量增加。这导致各种压缩模型的尝试,但现有方法没有考虑到各种模型组成部分的预测力或压缩模型的一般性的差异。为了理解模型压缩和分配外的概括性之间的联系,我们界定了压缩语言代表模型的任务,使其在领域适应环境中表现最佳。我们选择从因果角度解决这一问题,试图估计模型预测中单层等模型组成部分的平均处理效果。我们提议的ATE引导模型压缩方案(AMC)产生了许多模型候选人,而模型组成部分被删除了。然后,我们通过一个渐进回归模型选择最佳候选人,利用ATE来预测目标领域的预期业绩。AMC超越了在三个文本分类和顺序标记任务中几十对域配的强大基线。