This work provides a comprehensive analysis of the generalization properties of Neural Operators (NOs) and their derived architectures. Through empirical evaluation of the test loss, analysis of the complexity-based generalization bounds, and qualitative assessments of the visualization of the loss landscape, we investigate modifications aimed at enhancing the generalization capabilities of NOs. Inspired by the success of Transformers, we propose ${\textit{s}}{\text{NO}}+\varepsilon$, which introduces a kernel integral operator in lieu of self-Attention. Our results reveal significantly improved performance across datasets and initializations, accompanied by qualitative changes in the visualization of the loss landscape. We conjecture that the layout of Transformers enables the optimization algorithm to find better minima, and stochastic depth, improve the generalization performance. As a rigorous analysis of training dynamics is one of the most prominent unsolved problems in deep learning, our exclusive focus is on the analysis of the complexity-based generalization of the architectures. Building on statistical theory, and in particular Dudley theorem, we derive upper bounds on the Rademacher complexity of NOs, and ${\textit{s}}{\text{NO}}+\varepsilon$. For the latter, our bounds do not rely on norm control of parameters. This makes it applicable to networks of any depth, as long as the random variables in the architecture follow a decay law, which connects stochastic depth with generalization, as we have conjectured. In contrast, the bounds in NOs, solely rely on norm control of the parameters, and exhibit an exponential dependence on depth. Furthermore, our experiments also demonstrate that our proposed network exhibits remarkable generalization capabilities when subjected to perturbations in the data distribution. In contrast, NO perform poorly in out-of-distribution scenarios.
翻译:本研究提供了对神经操作器(NOs)及其衍生架构的推广性能的全面分析。通过对测试损失的实证评估、基于复杂度的概括性界限的分析以及损失景观可视化的定性评估,我们研究旨在增强NOs的推广能力的修改。受到Transformer成功的启发,我们提出了${\textit{s}}{\text{NO}}+\varepsilon$,其在自我关注方面引入了核积分器算子。我们的结果显示,在各种数据集和初始化中,性能显著提高,损失景观的可视化发生了定性变化。我们推测,Transformer的布局使优化算法能够找到更好的最小值,而随机深度则改善了推广性能。由于训练动态的严格分析是深度学习中最突出的未解决问题之一,因此我们的独家重点是分析架构的基于复杂度的推广。建立在统计理论的基础之上,特别是Dudley定理的基础之上,我们推导出NOs和${\textit{s}}{\text{NO}}+\varepsilon$的Rademacher复杂度的上限界限。对于后者,我们的界限不依赖于参数的范数控制。只要架构中的随机变量遵循衰减定律,这就使其适用于任何深度的网络,这将随机深度与推广联系起来,正如我们所推测的那样。相比之下,NOs中的限制仅依赖于参数的范数控制,并且在深度方面表现出指数依赖性。此外,我们的实验还证明,当数据分布受到扰动时,我们提出的网络表现出卓越的推广能力。相比之下,NO在脱离分布的情况下表现不佳。