Recent work has demonstrated that pre-training in-domain language models can boost performance when adapting to a new domain. However, the costs associated with pre-training raise an important question: given a fixed budget, what steps should an NLP practitioner take to maximize performance? In this paper, we study domain adaptation under budget constraints, and approach it as a customer choice problem between data annotation and pre-training. Specifically, we measure the annotation cost of three procedural text datasets and the pre-training cost of three in-domain language models. Then we evaluate the utility of different combinations of pre-training and data annotation under varying budget constraints to assess which combination strategy works best. We find that, for small budgets, spending all funds on annotation leads to the best performance; once the budget becomes large enough, a combination of data annotation and in-domain pre-training works more optimally. We therefore suggest that task-specific data annotation should be part of an economical strategy when adapting an NLP model to a new domain.
翻译:最近的工作表明,培训前主要语言模式在适应新领域时可以提高绩效。然而,培训前费用提出了一个重要的问题:如果预算固定下来,国家学习计划从业人员应该采取什么步骤最大限度地提高绩效?在本文件中,我们在预算限制下研究领域适应问题,并将其作为数据说明与培训前之间的客户选择问题来处理。具体地说,我们衡量三个程序文本数据集的批注费用和三个培训前语言模式的培训前费用。然后,我们评估在不同的预算限制下培训前和数据说明的不同组合对评估何种组合战略最有效的作用。我们发现,对于小预算而言,所有资金用于说明都会导致最佳绩效;一旦预算足够大,数据注注和在主编培训前工作就能发挥最佳效果。因此,我们建议,在调整国家学习计划模式以适应新领域时,具体任务的数据说明应该成为经济战略的一部分。