The most successful methods such as ReLU transfer functions, batch normalization, Xavier initialization, dropout, learning rate decay, or dynamic optimizers, have become standards in the field due, particularly, to their ability to increase the performance of Neural Networks (NNs) significantly and in almost all situations. Here we present a new method to calculate the gradients while training NNs, and show that it significantly improves final performance across architectures, data-sets, hyper-parameter values, training length, and model sizes, including when it is being combined with other common performance-improving methods (such as the ones mentioned above). Besides being effective in the wide array situations that we have tested, the increase in performance (e.g. F1) it provides is as high or higher than this one of all the other widespread performance-improving methods that we have compared against. We call our method Population Gradients (PG), and it consists on using a population of NNs to calculate a non-local estimation of the gradient, which is closer to the theoretical exact gradient (i.e. this one obtainable only with an infinitely big data-set) of the error function than the empirical gradient (i.e. this one obtained with the real finite data-set).
翻译:ReLU 传输功能、 批量正常化、 Xavier 初始化、 辍学、 学习率衰减或动态优化等最成功的方法,尤其由于能够大大提高神经网络(NNS)的性能,几乎在所有情况下都成为了实地的标准。 我们在这里展示了一种新的方法,用以计算梯度,同时培训NNS, 并表明它大大改进了结构、 数据集、 超参数值、 培训长度和模型大小之间的最后性能, 包括当它与其他共同的性能改进方法( 如上文提到的那些方法)相结合时。 除了在经过测试的广泛阵列情况中有效外, 性能提高( 如F1) 也比我们与之比较的所有其他广泛的性能改进方法高或高。 我们称之为“ 人口梯度” (PG), 它包括使用一个非本地性能估算, 与理论精确的梯度( i. i. i. i. i. i. 只能用一个无限大的数据错误获得的实验性能 。