Existing topic modeling and text segmentation methodologies generally require large datasets for training, limiting their capabilities when only small collections of text are available. In this work, we reexamine the inter-related problems of "topic identification" and "text segmentation" for sparse document learning, when there is a single new text of interest. In developing a methodology to handle single documents, we face two major challenges. First is sparse information: with access to only one document, we cannot train traditional topic models or deep learning algorithms. Second is significant noise: a considerable portion of words in any single document will produce only noise and not help discern topics or segments. To tackle these issues, we design an unsupervised, computationally efficient methodology called BATS: Biclustering Approach to Topic modeling and Segmentation. BATS leverages three key ideas to simultaneously identify topics and segment text: (i) a new mechanism that uses word order information to reduce sample complexity, (ii) a statistically sound graph-based biclustering technique that identifies latent structures of words and sentences, and (iii) a collection of effective heuristics that remove noise words and award important words to further improve performance. Experiments on four datasets show that our approach outperforms several state-of-the-art baselines when considering topic coherence, topic diversity, segmentation, and runtime comparison metrics.
翻译:现有的专题模型和文本分解方法通常需要大量的数据集用于培训,在只有少量文本收集的情况下限制了它们的能力。在这项工作中,我们重新审查了“专题识别”和“文本分解”等相关问题,以进行稀少的文件学习。在开发处理单一文件的方法时,我们面临两大挑战。第一是信息稀少:只有一份文件,我们无法培训传统的专题模型或深层次学习算法。第二是重大噪音:任何一份文件中的一大部分单词将只产生噪音,而无助于辨别议题或部分。为了解决这些问题,我们设计了一个不受监督的、具有计算效率的方法,称为“BATS:专题建模和分解的双组方法”。 BATS利用三个关键想法,同时确定主题和分部分文本。(一) 使用单词订购信息来降低样本复杂性的新机制,(二) 以统计健全的图表为基础的双组技术,确定潜在的文字和句体结构,以及(三) 收集有效的超动词法,删除噪音和授予重要的专题或部分内容。实验性地展示了我们四个主题的标准化,在研究时,试验了四个主题的模型,展示了我们运行的基线的模型。