Prior-data fitted networks (PFNs) are a promising alternative to time-consuming Gaussian process (GP) inference for creating fast surrogates of physical systems. PFN reduces the computational burden of GP-training by replacing Bayesian inference in GP with a single forward pass of a learned prediction model. However, with standard Transformer attention, PFNs show limited effectiveness on high-dimensional regression tasks. We introduce Decoupled-Value Attention (DVA)-- motivated by the GP property that the function space is fully characterized by the kernel over inputs and the predictive mean is a weighted sum of training targets. DVA computes similarities from inputs only and propagates labels solely through values. Thus, the proposed DVA mirrors the GP update while remaining kernel-free. We demonstrate that PFNs are backbone architecture invariant and the crucial factor for scaling PFNs is the attention rule rather than the architecture itself. Specifically, our results demonstrate that (a) localized attention consistently reduces out-of-sample validation loss in PFNs across different dimensional settings, with validation loss reduced by more than 50% in five- and ten-dimensional cases, and (b) the role of attention is more decisive than the choice of backbone architecture, showing that CNN, RNN and LSTM-based PFNs can perform at par with their Transformer-based counterparts. The proposed PFNs provide 64-dimensional power flow equation approximations with a mean absolute error of the order of E-03, while being over 80x faster than exact GP inference.
翻译:先验数据拟合网络(PFNs)是替代耗时的高斯过程(GP)推断的一种有前景的方法,用于构建物理系统的快速代理模型。PFN通过用学习预测模型的单次前向传播替代GP中的贝叶斯推断,显著降低了GP训练的计算负担。然而,采用标准Transformer注意力机制时,PFN在高维回归任务中表现出有限的有效性。我们提出了解耦值注意力(DVA)——其设计灵感来源于GP的特性:函数空间完全由输入上的核函数表征,且预测均值是训练目标的加权和。DVA仅从输入计算相似度,并仅通过值传播标签。因此,所提出的DVA在保持无核函数的同时,模拟了GP的更新过程。我们证明PFN具有主干架构不变性,且扩展PFN的关键因素在于注意力规则而非架构本身。具体而言,我们的结果表明:(a)局部化注意力在不同维度设置下持续降低PFN的样本外验证损失,在五维和十维情况下验证损失降低超过50%;(b)注意力的作用比主干架构的选择更具决定性,表明基于CNN、RNN和LSTM的PFN性能可与基于Transformer的对应模型相媲美。所提出的PFN为64维潮流方程提供了近似解,其平均绝对误差达到E-03量级,同时计算速度比精确GP推断快80倍以上。