Unsupervised pretraining, which learns a useful representation using a large amount of unlabeled data to facilitate the learning of downstream tasks, is a critical component of modern large-scale machine learning systems. Despite its tremendous empirical success, the rigorous theoretical understanding of why unsupervised pretraining generally helps remains rather limited -- most existing results are restricted to particular methods or approaches for unsupervised pretraining with specialized structural assumptions. This paper studies a generic framework, where the unsupervised representation learning task is specified by an abstract class of latent variable models $\Phi$ and the downstream task is specified by a class of prediction functions $\Psi$. We consider a natural approach of using Maximum Likelihood Estimation (MLE) for unsupervised pretraining and Empirical Risk Minimization (ERM) for learning downstream tasks. We prove that, under a mild ''informative'' condition, our algorithm achieves an excess risk of $\tilde{\mathcal{O}}(\sqrt{\mathcal{C}_\Phi/m} + \sqrt{\mathcal{C}_\Psi/n})$ for downstream tasks, where $\mathcal{C}_\Phi, \mathcal{C}_\Psi$ are complexity measures of function classes $\Phi, \Psi$, and $m, n$ are the number of unlabeled and labeled data respectively. Comparing to the baseline of $\tilde{\mathcal{O}}(\sqrt{\mathcal{C}_{\Phi \circ \Psi}/n})$ achieved by performing supervised learning using only the labeled data, our result rigorously shows the benefit of unsupervised pretraining when $m \gg n$ and $\mathcal{C}_{\Phi\circ \Psi} > \mathcal{C}_\Psi$. This paper further shows that our generic framework covers a wide range of approaches for unsupervised pretraining, including factor models, Gaussian mixture models, and contrastive learning.
翻译:未监督的预培训, 使用大量未贴标签的数据来学习下游任务, 学习一个有用的演示 { 未监管的预培训, 是现代大型机器学习系统的关键组成部分 。 尽管它取得了巨大的实证成功, 但是对于未监管的预培训通常帮助的程度, 严格的理论理解仍然相当有限 -- 大多数现有结果都局限于特定的方法或方法, 使用专门的结构性假设进行未经监管的预培训。 本文研究一个通用框架, 其中未监管的演示学习任务由一组抽象的潜在变量模型 $\ Phi$ (Phi$) 指定, 下游任务由某类的预测函数 $\ Psi$( Psi$ ) 指定 。 我们考虑一种自然的方法, 使用最大相似的缩略缩缩缩缩缩缩( MLE) 来学习下游任务 。 我们证明, 在“ 信息化” 状态下, 我们的算法具有超额风险 $tilde a mathal {O} (\\\ prock) (clation_ Phi} C / mal} 和 modiaal motional listrational listrate listress listress listress ex laud ex ex ex lishts = = = = = = $ = $ = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = </s>