Traditionally, Latent Dirichlet Allocation (LDA) ingests words in a collection of documents to discover their latent topics using word-document co-occurrences. However, it is unclear how to achieve the best results for languages without marked word boundaries such as Chinese and Thai. Here, we explore the use of Pearson's chi-squared test, t-statistics, and Word Pair Encoding (WPE) to produce tokens as input to the LDA model. The Chi-squared, t, and WPE tokenizers are trained on Wikipedia text to look for words that should be grouped together, such as compound nouns, proper nouns, and complex event verbs. We propose a new metric for measuring the clustering quality in settings where the vocabularies of the models differ. Based on this metric and other established metrics, we show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
翻译:在传统意义上,LDA(LDA)在一系列文件中录用文字,用文字文档共同发现其潜在主题。 但是,如何在没有中国和泰语等标明词界线的情况下,为语言取得最佳结果还不清楚。 我们在这里探索使用Pearson的奇质配方测试、t-统计学和Word Pair Encoding(WPE)来制作标语,作为LDA模型的投入。 Chi-squared、 t 和WPE 标识器在维基百科文本上接受了培训,以寻找应该组合的词,如复合名词、适当的名词和复杂的事件动词。 我们提出了在模型的词汇不同环境中测量群集质量的新指标。 根据这一指标和其他既定指标,我们显示,用合并标语培训的专题导致主题键更加清晰、更加一致,在区分主题方面比未合并模型更有效。