Topic model evaluation, like evaluation of other unsupervised methods, can be contentious. However, the field has coalesced around automated estimates of topic coherence, which rely on the frequency of word co-occurrences in a reference corpus. Contemporary neural topic models surpass classical ones according to these metrics. At the same time, topic model evaluation suffers from a validation gap: automated coherence, developed for classical models, has not been validated using human experimentation for neural models. In addition, a meta-analysis of topic modeling literature reveals a substantial standardization gap in automated topic modeling benchmarks. To address the validation gap, we compare automated coherence with the two most widely accepted human judgment tasks: topic rating and word intrusion. To address the standardization gap, we systematically evaluate a dominant classical model and two state-of-the-art neural models on two commonly used datasets. Automated evaluations declare a winning model when corresponding human evaluations do not, calling into question the validity of fully automatic evaluations independent of human judgments.
翻译:专题模型评价与其他未受监督的方法一样,可能会引起争议。然而,实地围绕专题一致性的自动估计,依靠参考材料中共同出现字数的频率,对专题一致性进行自动估计,这取决于在参考材料中共同出现字数的频率。当代神经专题模型比根据这些指标的经典模型要强。同时,专题模型评价也存在验证差距:为古典模型开发的自动化一致性,尚未在神经模型中使用人类实验进行验证。此外,对专题模型文献的元分析显示,在自动专题模型基准方面存在着巨大的标准化差距。为了解决验证差距,我们将自动一致性与两种最广泛接受的人类判断任务(专题评级和侵入字数)进行比较。为了解决标准化差距,我们系统地评价了两种常用数据集的主要经典模型和两种最先进的神经模型。一个自动评价在相应的人类评价不成功时宣布一个成功模型,从而质疑完全自动评价是否有效,而独立于人类判断。