Embedding learning has found widespread applications in recommendation systems and natural language modeling, among other domains. To learn quality embeddings efficiently, adaptive learning rate algorithms have demonstrated superior empirical performance over SGD, largely accredited to their token-dependent learning rate. However, the underlying mechanism for the efficiency of token-dependent learning rate remains underexplored. We show that incorporating frequency information of tokens in the embedding learning problems leads to provably efficient algorithms, and demonstrate that common adaptive algorithms implicitly exploit the frequency information to a large extent. Specifically, we propose (Counter-based) Frequency-aware Stochastic Gradient Descent, which applies a frequency-dependent learning rate for each token, and exhibits provable speed-up compared to SGD when the token distribution is imbalanced. Empirically, we show the proposed algorithms are able to improve or match adaptive algorithms on benchmark recommendation tasks and a large-scale industrial recommendation system, closing the performance gap between SGD and adaptive algorithms. Our results are the first to show token-dependent learning rate provably improves convergence for non-convex embedding learning problems.
翻译:嵌入式学习在建议系统和自然语言建模中发现,在建议系统和自然语言建模等方面应用了广泛的应用。为了有效地学习质量嵌入,适应性学习率算法显示出优于SGD的经验性表现,而SGD大多被认可为象征性依赖学习率。然而,关于象征性依赖学习率效率的基本机制仍未得到充分探讨。我们表明,在嵌入学习问题时,将象征物的频度信息纳入嵌入式学习率的频率信息导致可辨的有效算法,并表明共同的适应性算法在很大程度上隐含着频率信息的利用。具体地说,我们提议(基于审校的)频率感应变系统,对每个象征物适用基于频率的学习率,并展示在象征性分布不平衡时与 SGD 相比的可变速度。我们随机地表明,拟议的算法能够改进或匹配基准建议任务和大规模工业建议系统的适应性算法,从而缩小 SGD和适应性算法之间的性差距。我们的结果首先显示的是,对非convex嵌式学习问题具有象征性依赖性学习率的融合。