In this paper we show that a simple, data dependent way of setting the initial vector can be used to substantially speed up the training of linear one-versus-all (OVA) classifiers in extreme multi-label classification (XMC). We discuss the problem of choosing the initial weights from the perspective of three goals. We want to start in a region of weight space a) with low loss value, b) that is favourable for second-order optimization, and c) where the conjugate-gradient (CG) calculations can be performed quickly. For margin losses, such an initialization is achieved by selecting the initial vector such that it separates the mean of all positive (relevant for a label) instances from the mean of all negatives -- two quantities that can be calculated quickly for the highly imbalanced binary problems occurring in XMC. We demonstrate a speedup of $\approx 3\times$ for training with squared hinge loss on a variety of XMC datasets. This comes in part from the reduced number of iterations that need to be performed due to starting closer to the solution, and in part from an implicit negative mining effect that allows to ignore easy negatives in the CG step. Because of the convex nature of the optimization problem, the speedup is achieved without any degradation in classification accuracy.
翻译:在本文中,我们展示了一种简单、数据依赖的方式来设定初始矢量的方法,可以用来大大加快对极端多标签分类(XMC)中的线性一对一(OVA)分类师的培训。我们从三个目标的角度讨论选择初始权重的问题。我们想从一个重量空间区域开始,a) 低损失值,b) 有利于第二级优化,c) 能够快速进行同源分级(CG)计算。对于差值损失,通过选择初始矢量实现这种初始化,从而将所有正(与标签相关的)情况与所有负值的平均值区分开来。我们想从三个目标的角度来讨论选择初始权重的问题。我们想从一个重量空间区域开始,a) 低损失值,b) 有利于第二级优化,b) 和 c) 快速进行同级(CG) 的计算。部分原因在于由于接近解决方案而需要完成的迭代数减少,而使所有正值(与标签相关)的平均值与所有负值(与所有负值)情况的平均值分开,部分是因为X级的精确度效应使得不易忽略了负式的降解。