The most common ways to explore latent document dimensions are topic models and clustering methods. However, topic models have several drawbacks: e.g., they require us to choose the number of latent dimensions a priori, and the results are stochastic. Most clustering methods have the same issues and lack flexibility in various ways, such as not accounting for the influence of different topics on single documents, forcing word-descriptors to belong to a single topic (hard-clustering) or necessarily relying on word representations. We propose PROgressive SImilarity Thresholds - ProSiT, a deterministic and interpretable method, agnostic to the input format, that finds the optimal number of latent dimensions and only has two hyper-parameters, which can be set efficiently via grid search. We compare this method with a wide range of topic models and clustering methods on four benchmark data sets. In most setting, ProSiT matches or outperforms the other methods in terms six metrics of topic coherence and distinctiveness, producing replicable, deterministic results.
翻译:探索潜在文件维度的最常见方式是专题模型和群集方法。然而,专题模型有几个缺点:例如,它们要求我们先验地选择潜在维度的数量,结果是随机的。大多数群集方法存在同样的问题,在各种方式上缺乏灵活性,例如不考虑不同专题对单一文件的影响,迫使字条标本属于一个单一专题(硬集)或必然依赖单词表达方式。我们建议采取渐进性的单一性临界值-ProSiT,一种确定性和可解释的方法,一种可解释的方法,一种对输入格式的可理解性,找到潜在维度的最佳数量,只有两种超参数,可以通过网格搜索有效设定。我们将这种方法与四套基准数据集的一系列专题模型和群集方法作比较。在多数情况下,ProSiT与六个主题一致性和独特性指标相匹配或超越其他方法,产生可复制的确定性结果。