Topic model evaluation, like evaluation of other unsupervised methods, can be contentious. However, the field has coalesced around automated estimates of topic coherence, which rely on the frequency of word co-occurrences in a reference corpus. Recent models relying on neural components surpass classical topic models according to these metrics. At the same time, unlike classical models, the practice of neural topic model evaluation suffers from a validation gap: automatic coherence for neural models has not been validated using human experimentation. In addition, as we show via a meta-analysis of topic modeling literature, there is a substantial standardization gap in the use of automated topic modeling benchmarks. We address both the standardization gap and the validation gap. Using two of the most widely used topic model evaluation datasets, we assess a dominant classical model and two state-of-the-art neural models in a systematic, clearly documented, reproducible way. We use automatic coherence along with the two most widely accepted human judgment tasks, namely, topic rating and word intrusion. Automated evaluation will declare one model significantly different from another when corresponding human evaluations do not, calling into question the validity of fully automatic evaluations independent of human judgments.
翻译:与对其他不受监督的方法的评价一样,对神经专题模式的评价也可能会引起争议。然而,该领域围绕主题一致性的自动化估计,在参考材料中依赖于单词共发频率,这取决于主题一致性的自动估计。最近依赖神经元组成部分的模型根据这些指标超过了古典主题模型。与此同时,与古典模型不同,神经专题模型评价的做法存在验证差距:神经模型的自动一致性没有通过人类实验得到验证。此外,正如我们通过对专题模型文献进行元分析所显示的那样,在使用自动专题模型基准方面存在着巨大的标准化差距。我们处理标准化差距和验证差距。我们使用两种最广泛使用的专题模型评估数据集,用一种占主导地位的经典模型和两种最先进的神经模型,系统、有明确记录、可复制的方式进行评估。我们使用自动一致性以及两种最广泛接受的人类判断任务,即专题评级和单词侵入。自动评价将显示一种模式与另一种模式有很大差异,因为对人的评价没有进行相应评价,因此质疑完全自动独立评价的有效性。