Hollmann et al. (Nature 637 (2025) 319-326) recently introduced TabPFN, a transformer-based deep learning model for regression and classification on tabular data, which they claim "outperforms all previous methods on datasets with up to 10,000 samples by a wide margin, using substantially less training time." Furthermore, they have called TabPFN a "foundation model" for tabular data, as it can support "data generation, density estimation, learning reusable embeddings and fine-tuning". In this paper, we provide a tailored explanation of how TabPFN works for a statistics audience, by emphasizing its interpretation as approximate Bayesian inference. We then explore the significance of TabPFN to the field of statistics: We show that an out-of-the-box application of TabPFN can sometimes outperform specialized state-of-the-art methods for semi-supervised parameter estimation, prediction under covariate shift, and heterogeneous treatment effect estimation. As a partial explanation for the predictive effectiveness of TabPFN, we show that it can simultaneously adapt to both nonparametric structure and parametric structure, for instance, sometimes outperforming LASSO even when assumptions are correctly specified. All experiments can be reproduced using the code provided at https://github.com/qinglong-tian/tabpfn_study (https://github.com/qinglong-tian/tabpfn_study).
翻译:Hollmann等人(《自然》637卷(2025年)319-326页)近期提出了TabPFN,这是一种基于Transformer的深度学习模型,用于处理表格数据的回归与分类任务。他们声称该模型'在样本量不超过10,000的数据集上,以显著优势超越所有先前方法,且训练时间大幅减少'。此外,他们将TabPFN称为表格数据的'基础模型',因其能够支持'数据生成、密度估计、可复用嵌入学习及微调'。本文针对统计学领域的读者,通过强调其近似贝叶斯推断的解释,提供了对TabPFN工作原理的定制化阐释。随后,我们探讨了TabPFN对统计学领域的重要意义:研究表明,在未经定制化调整的情况下应用TabPFN,有时能在半监督参数估计、协变量偏移下的预测以及异质性处理效应估计等任务中,超越专门的先进方法。作为对TabPFN预测效能的部分解释,我们证明该模型能同时适应非参数结构与参数结构,例如在假设条件正确设定的情况下,有时甚至能超越LASSO方法。所有实验均可通过https://github.com/qinglong-tian/tabpfn_study 提供的代码复现。