This paper considers the Pointer Value Retrieval (PVR) benchmark introduced in [ZRKB21], where a 'reasoning' function acts on a string of digits to produce the label. More generally, the paper considers the learning of logical functions with gradient descent (GD) on neural networks. It is first shown that in order to learn logical functions with gradient descent on symmetric neural networks, the generalization error can be lower-bounded in terms of the noise-stability of the target function, supporting a conjecture made in [ZRKB21]. It is then shown that in the distribution shift setting, when the data withholding corresponds to freezing a single feature (referred to as canonical holdout), the generalization error of gradient descent admits a tight characterization in terms of the Boolean influence for several relevant architectures. This is shown on linear models and supported experimentally on other models such as MLPs and Transformers. In particular, this puts forward the hypothesis that for such architectures and for learning logical functions such as PVR functions, GD tends to have an implicit bias towards low-degree representations, which in turn gives the Boolean influence for the generalization error under quadratic loss.
翻译:本文件考虑了[ZRKB21] 中引入的指针值检索基准(PVR), 即“ 说明” 函数在一组数字串中发挥作用, 以生成标签。 更一般地说, 本文考虑的是神经网络中渐渐下降( GD) 的逻辑函数学习。 首先, 为了在对称神经网络中学习梯度下降的逻辑函数, 普遍性错误可以在目标函数的噪声稳定性方面受到较低限制, 支持 [ZRKB21] 中所作的猜想。 然后显示, 在分配变化设置中, 当数据预扣相当于冻结一个特性( 被称为“ 罐头缓冲 ” ) 时, 梯度下降的一般错误在对若干相关结构的布利恩影响方面有严格的定性。 这在线性模型上显示, 并实验性地支持其他模型, 如 MLPs 和 变换器。 特别是, 这提出了这样的假设: 在分配变化中, 当数据转换函数( PVR 函数) 时, GD 往往会以隐含的偏向下偏向低度表示方向的偏差, 。