Current NLP models are predominantly trained through a two-stage "pre-train then fine-tune" pipeline. Prior work has shown that inserting an intermediate pre-training stage, using heuristic masking policies for masked language modeling (MLM), can significantly improve final performance. However, it is still unclear (1) in what cases such intermediate pre-training is helpful, (2) whether hand-crafted heuristic objectives are optimal for a given task, and (3) whether a masking policy designed for one task is generalizable beyond that task. In this paper, we perform a large-scale empirical study to investigate the effect of various masking policies in intermediate pre-training with nine selected tasks across three categories. Crucially, we introduce methods to automate the discovery of optimal masking policies via direct supervision or meta-learning. We conclude that the success of intermediate pre-training is dependent on appropriate pre-train corpus, selection of output format (i.e., masked spans or full sentence), and clear understanding of the role that MLM plays for the downstream task. In addition, we find our learned masking policies outperform the heuristic of masking named entities on TriviaQA, and policies learned from one task can positively transfer to other tasks in certain cases, inviting future research in this direction.
翻译:目前NLP模式主要通过两个阶段的“培训前培训然后微调”编程来培训,先前的工作表明,在培训前阶段插入中间阶段,使用隐蔽语言模型的超常遮掩政策,可以大大改善最后业绩,然而,目前尚不清楚:(1) 在哪些情况下,这种中级培训前培训有帮助,(2) 手工制作的休眠目标是否最适合特定任务,(3) 为一项任务设计的遮掩政策是否普遍适用到任务之外。在本文件中,我们进行了大规模的经验性研究,以调查在中间培训前阶段采用各种遮掩政策的效果,并分三类执行九项任务。至关重要的是,我们采用各种方法,通过直接监督或元学习将发现最佳遮掩政策自动化。我们的结论是,中间培训成功与否取决于适当的排注前工作、选择产出格式(即遮掩掩罩或句),以及清楚地了解MLM在下游任务中的作用。此外,我们发现,我们学到的掩掩掩掩掩政策的影响超越了中期任务,从头向其它任务转移任务,我们发现,从头项任务向未来任务转移任务,从头头任务,从一个任务向另一个任务。