This paper presents a new statistical analysis aiming to explain the recent superior achievements of the pre-training techniques in natural language processing (NLP). We prove that when the classes of the pre-training task (e.g., different words in the masked language model task) are sufficiently diverse, in the sense that the least singular value of the last linear layer in pre-training (denoted as $\tilde{\nu}$) is large, then pre-training can significantly improve the sample efficiency of downstream tasks. Specially, we show the transfer learning excess risk enjoys an $O\left(\frac{1}{\tilde{\nu} \sqrt{n}}\right)$ rate, in contrast to the $O\left(\frac{1}{\sqrt{m}}\right)$ rate in the standard supervised learning. Here, $n$ is the number of pre-training data and $m$ is the number of data in the downstream task, and typically $n \gg m$. Our proof relies on a vector-form Rademacher complexity chain rule for disassembling composite function classes and a modified self-concordance condition. These techniques can be of independent interest.
翻译:本文介绍了一项新的统计分析,旨在解释培训前自然语言处理技术(NLP)最近取得的优异成就。我们证明,当培训前任务类别(例如,隐蔽语言模式任务中的不同字词)充分多样化时,培训前最后线性层的最低单值(称为$\tilde_nu})是很大的,然后培训前可以大大提高下游任务的抽样效率。特别是,我们表明,转让学习过度风险的超额风险享有美元(left) (\frac{1untlex}\sqrt{n ⁇ right) 的费率,与标准监督学习中的$(fleft) (\frac{1unsqrt{m ⁇ right) 的费率相对。在这里,美元是培训前数据的数量,美元是下游任务中的数据数量,通常为$ng=m美元。我们的证据依赖于矢量组合功能分解的雷德马彻复杂链规则。