预培训前还是批注? (Pre-train or Annotate? Domain Adaptation with a Constrained Budget)

Recent work has demonstrated that pre-training in-domain language models can boost performance when adapting to a new domain. However, the costs associated with pre-training raise an important question: given a fixed budget, what steps should an NLP practitioner take to maximize performance? In this paper, we study domain adaptation under budget constraints, and approach it as a customer choice problem between data annotation and pre-training. Specifically, we measure the annotation cost of three procedural text datasets and the pre-training cost of three in-domain language models. Then we evaluate the utility of different combinations of pre-training and data annotation under varying budget constraints to assess which combination strategy works best. We find that, for small budgets, spending all funds on annotation leads to the best performance; once the budget becomes large enough, a combination of data annotation and in-domain pre-training works more optimally. We therefore suggest that task-specific data annotation should be part of an economical strategy when adapting an NLP model to a new domain.

翻译：最近的工作表明,培训前主要语言模式在适应新领域时可以提高绩效。然而,培训前费用提出了一个重要的问题:如果预算固定下来,国家学习计划从业人员应该采取什么步骤最大限度地提高绩效?在本文件中,我们在预算限制下研究领域适应问题,并将其作为数据说明与培训前之间的客户选择问题来处理。具体地说,我们衡量三个程序文本数据集的批注费用和三个培训前语言模式的培训前费用。然后,我们评估在不同的预算限制下培训前和数据说明的不同组合对评估何种组合战略最有效的作用。我们发现,对于小预算而言,所有资金用于说明都会导致最佳绩效;一旦预算足够大,数据注注和在主编培训前工作就能发挥最佳效果。因此,我们建议,在调整国家学习计划模式以适应新领域时,具体任务的数据说明应该成为经济战略的一部分。

相关内容

Performance

关注 3

Performance：International Symposium on Computer Performance Modeling, Measurements and Evaluation。 Explanation：计算机性能建模、测量和评估国际研讨会。 Publisher：ACM。 SIT：http://dblp.uni-trier.de/db/conf/performance/

知识图谱推理，50页ppt，Salesforce首席科学家Richard Socher

专知会员服务

111+阅读 · 2020年6月10日

【ACL2020】不要停止预训练:根据领域和任务自适应调整语言模型，Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

专知会员服务

46+阅读 · 2020年4月25日

【ACL2020-CMU】预训练模型权重攻击，Weight Poisoning Attacks on PTM

专知会员服务

12+阅读 · 2020年4月16日

【LITIS Lab】衔接图卷积神经网络谱域和空间域，Spectral and Spatial Domains in GNN

专知会员服务

25+阅读 · 2020年3月30日