As the vocabulary size of modern word-based language models becomes ever larger, many sampling-based training criteria are proposed and investigated. The essence of these sampling methods is that the softmax-related traversal over the entire vocabulary can be simplified, giving speedups compared to the baseline. A problem we notice about the current landscape of such sampling methods is the lack of a systematic comparison and some myths about preferring one over another. In this work, we consider Monte Carlo sampling, importance sampling, a novel method we call compensated partial summation, and noise contrastive estimation. Linking back to the three traditional criteria, namely mean squared error, binary cross-entropy, and cross-entropy, we derive the theoretical solutions to the training problems. Contrary to some common belief, we show that all these sampling methods can perform equally well, as long as we correct for the intended class posterior probabilities. Experimental results in language modeling and automatic speech recognition on Switchboard and LibriSpeech support our claim, with all sampling-based methods showing similar perplexities and word error rates while giving the expected speedups.
翻译:随着现代以字为基础的语言模型的词汇规模越来越大,许多基于抽样的培训标准都得到提出和调查。这些抽样方法的实质是,可以简化整个词汇中与软式马克斯相关的跨度,使速度与基线相比较。我们注意到,关于这种抽样方法当前景观的一个问题在于缺乏系统比较,以及某些关于相互偏爱的神话。在这项工作中,我们认为蒙特卡洛取样、重要取样、一种我们称之为部分加和的新颖方法,以及噪音对比估计。与三种传统标准,即平均正方形错误、二元交叉杂交和交叉成份联系起来,我们从理论上找到解决培训问题的解决方案。与一些共同的看法相反,我们发现所有这些抽样方法都能同样地发挥作用,只要我们纠正了预期的阶级后表概率。在交换台和利布里施佩奇的语言建模和自动语音识别方面的实验结果支持我们的要求,所有基于取样的方法都显示出类似的折叠和字词错率,同时给出了预期的加速率。