As the amount of text data continues to grow, topic modeling is serving an important role in understanding the content hidden by the overwhelming quantity of documents. One popular topic modeling approach is non-negative matrix factorization (NMF), an unsupervised machine learning (ML) method. Recently, Semantic NMF with automatic model selection (SeNMFk) has been proposed as a modification to NMF. In addition to heuristically estimating the number of topics, SeNMFk also incorporates the semantic structure of the text. This is performed by jointly factorizing the term frequency-inverse document frequency (TF-IDF) matrix with the co-occurrence/word-context matrix, the values of which represent the number of times two words co-occur in a predetermined window of the text. In this paper, we introduce a novel distributed method, SeNMFk-SPLIT, for semantic topic extraction suitable for large corpora. Contrary to SeNMFk, our method enables the joint factorization of large documents by decomposing the word-context and term-document matrices separately. We demonstrate the capability of SeNMFk-SPLIT by applying it to the entire artificial intelligence (AI) and ML scientific literature uploaded on arXiv.
翻译:随着文本数据的数量继续增加,主题建模在理解大量文件所隐藏的内容方面发挥了重要作用。一个流行的主题建模办法是非负矩阵因子化(NMF),这是一种不受监督的机器学习(ML)方法。最近,提出了带有自动模式选择(SeNMFFk)的语义型NMF(SeNMFFk),作为对NMFF的修改。除了对专题数量进行超常估计外,SENMFk还纳入了文本的语义结构。这是通过将文件频率反频频率(TF-IDF)矩阵与共同生成/文字矩阵(NMFF)共同乘以非负矩阵(NMFFF),其值代表在预定的文本窗口中共出现两个字数。在本文中,我们采用了新的分发方法(SENMFk-SPLIT),用于适合大型子公司的语义专题的提取。与SENMFFk相反,我们的方法通过将文字和术语文件缩写成词和术语文档矩阵,使大型文件的联合要素化为联合要素化。我们用SNMIS-SAIS在整部的科研文献上展示了它的能力。我们用MIS-L的能力。