We study in-context learning problems where a Transformer is pretrained on tasks drawn from a mixture distribution $\pi=\sum_{\alpha\in\mathcal{A}} \lambda_{\alpha} \pi_{\alpha}$, called the pretraining prior, in which each mixture component $\pi_{\alpha}$ is a distribution on tasks of a specific difficulty level indexed by $\alpha$. Our goal is to understand the performance of the pretrained Transformer when evaluated on a different test distribution $\mu$, consisting of tasks of fixed difficulty $\beta\in\mathcal{A}$, and with potential distribution shift relative to $\pi_\beta$, subject to the chi-squared divergence $\chi^2(\mu,\pi_{\beta})$ being at most $\kappa$. In particular, we consider nonparametric regression problems with random smoothness, and multi-index models with random smoothness as well as random effective dimension. We prove that a large Transformer pretrained on sufficient data achieves the optimal rate of convergence corresponding to the difficulty level $\beta$, uniformly over test distributions $\mu$ in the chi-squared divergence ball. Thus, the pretrained Transformer is able to achieve faster rates of convergence on easier tasks and is robust to distribution shift at test time. Finally, we prove that even if an estimator had access to the test distribution $\mu$, the convergence rate of its expected risk over $\mu$ could not be faster than that of our pretrained Transformers, thereby providing a more appropriate optimality guarantee than minimax lower bounds.
翻译:我们研究上下文学习问题,其中Transformer在从混合分布$\pi=\sum_{\alpha\in\mathcal{A}} \lambda_{\alpha} \pi_{\alpha}$(称为预训练先验)采样的任务上进行预训练。该混合分布中每个分量$\pi_{\alpha}$对应特定难度级别$\alpha$的任务分布。我们的目标是理解预训练Transformer在不同测试分布$\mu$上的性能表现:该测试分布由固定难度$\beta\in\mathcal{A}$的任务构成,且可能相对$\pi_\beta$存在分布偏移,其卡方散度$\chi^2(\mu,\pi_{\beta})$不超过$\kappa$。具体而言,我们考虑具有随机光滑度的非参数回归问题,以及同时具有随机光滑度与随机有效维度的多指标模型。我们证明,在充足数据上预训练的大型Transformer能够达到与难度级别$\beta$对应的最优收敛速率,且该结论在卡方散度球内的所有测试分布$\mu$上一致成立。因此,预训练Transformer能够在更简单的任务上实现更快的收敛速率,并对测试时的分布偏移具有鲁棒性。最后我们证明,即使某个估计器能够访问测试分布$\mu$,其在$\mu$上期望风险的收敛速率也不可能超越我们的预训练Transformer,这提供了比极小极大下界更恰当的最优性保证。