Natural gradient descent has a remarkable property that in the small learning rate limit, it displays an invariance with respect to network reparameterizations, leading to robust training behavior even for highly covariant network parameterizations. We show that optimization algorithms with this property can be viewed as discrete approximations of natural transformations from the functor determining an optimizer's state space from the diffeomorphism group if its configuration manifold, to the functor determining that state space's tangent bundle from this group. Algorithms with this property enjoy greater efficiency when used to train poorly parameterized networks, as the network evolution they generate is approximately invariant to network reparameterizations. More specifically, the flow generated by these algorithms in the limit as the learning rate vanishes is invariant under smooth reparameterizations, the respective flows of the parameters being determined by equivariant maps. By casting this property a natural transformation, we allow for generalizations beyond equivariance with respect to group actions; this framework can account for non-invertible maps such as projections, creating a framework for the direct comparison of training behavior across non-isomorphic network architectures, and the formal examination of limiting behavior as network size increases by considering inverse limits of these projections, should they exist. We introduce a simple method of introducing this naturality more generally and examine a number of popular machine learning training algorithms, finding that most are unnatural.
翻译:暂无翻译