The optimization problem associated to fitting two-layer ReLU networks having $d$~inputs, $k$~neurons, and labels generated by a target network, is considered. Two categories of infinite families of minima, giving one minimum per $d$ and $k$, were recently found. The loss at minima belonging to the first category converges to zero as $d$ increases. In the second category, the loss remains bounded away from zero. That being so, how may one avoid minima belonging to the latter category? Fortunately, such minima are never detected by standard optimization methods. Motivated by questions concerning the nature of this phenomenon, we develop methods to study distinctive analytic properties of hidden minima. By existing analyses, the Hessian spectrum of both categories agree modulus $O(d^{-1/2})$-terms -- not promising. Thus, rather, our investigation proceeds by studying curves along which the loss is minimized or maximized, referred to as tangency arcs. We prove that pure, seemingly remote, group representation-theoretic considerations concerning the arrangement of subspaces invariant to the action of subgroups of $S_d$, the symmetry group over $d$ symbols, relative to ones fixed by the action yield a precise description of all finitely many admissible types of tangency arcs. The general results applied for the loss function reveal that arcs emanating from hidden minima differ, characteristically, by their structure and symmetry, precisely on account of the $O(d^{-1/2})$-eigenvalue terms absent in previous work, indicating the subtly of the analysis. The theoretical results, stated and proved for o-minimal structures, show that the set comprising all tangency arcs is topologically sufficiently tame, permitting a numerical construction of tangency arcs, and ultimately, a comparison of how minima from both categories are positioned relative to adjacent critical points.
翻译:暂无翻译