选择变换器: Fourier 或 Galerkin (Choose a Transformer: Fourier or Galerkin)

In this paper, we apply the self-attention from the state-of-the-art Transformer in Attention Is All You Need for the first time to a data-driven operator learning problem related to partial differential equations. An effort is put together to explain the heuristics of, and to improve the efficacy of the attention mechanism. By employing the operator approximation theory in Hilbert spaces, it is demonstrated for the first time that the softmax normalization in the scaled dot-product attention is sufficient but not necessary. Without softmax, the approximation capacity of a linearized Transformer variant can be proved to be comparable to a Petrov-Galerkin projection layer-wise, and the estimate is independent with respect to the sequence length. A new layer normalization scheme mimicking the Petrov-Galerkin projection is proposed to allow a scaling to propagate through attention layers, which helps the model achieve remarkable accuracy in operator learning tasks with unnormalized data. Finally, we present three operator learning experiments, including the viscid Burgers' equation, an interface Darcy flow, and an inverse interface coefficient identification problem. The newly proposed simple attention-based operator learner, Galerkin Transformer, shows significant improvements in both training cost and evaluation accuracy over its softmax-normalized counterparts.

翻译：在本文中,我们首次对数据驱动的操作者学习与部分差异方程有关的部分差异方程的学习问题应用了“关注所有你需要”中最新变异器的自我关注。我们一起努力解释热量结构,提高关注机制的功效。在希尔伯特空间使用操作者近似理论,首次证明在扩大的 dot 产品中,软成像的常规关注已经足够,但并不必要。没有软成形的,线性变异器的近似能力可以证明与Petrov-Galerkin投影层相仿,而这一估计在序列长度方面是独立的。提议采用一种新的层正常化方案,模拟Petrov-Galerkin预测,以便通过关注层进行扩展,帮助模型在操作者学习任务与非常规数据之间实现显著的准确性。最后,我们介绍了三个操作者学习实验,包括逆向布尔格斯方程式、界面达西流和反界面系数识别问题。新提议的基于简单关注的操作者学习的精确度,Galereral-assimactal对等。

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。

【ICLR2022】Transformers亦能贝叶斯推断

专知会员服务

25+阅读 · 2021年12月23日

【PAISS 2021 教程】概率散度与生成式模型，92页ppt

专知会员服务

34+阅读 · 2021年11月30日

【杜克-Bhuwan Dhingra】语言模型即知识图谱，46页ppt

专知会员服务

67+阅读 · 2021年11月15日

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

51+阅读 · 2020年12月14日