To overcome the data sparsity issue in short text topic modeling, existing methods commonly rely on data augmentation or the data characteristic of short texts to introduce more word co-occurrence information. However, most of them do not make full use of the augmented data or the data characteristic: they insufficiently learn the relations among samples in data, leading to dissimilar topic distributions of semantically similar text pairs. To better address data sparsity, in this paper we propose a novel short text topic modeling framework, Topic-Semantic Contrastive Topic Model (TSCTM). To sufficiently model the relations among samples, we employ a new contrastive learning method with efficient positive and negative sampling strategies based on topic semantics. This contrastive learning method refines the representations, enriches the learning signals, and thus mitigates the sparsity issue. Extensive experimental results show that our TSCTM outperforms state-of-the-art baselines regardless of the data augmentation availability, producing high-quality topics and topic distributions.
翻译:为了克服短文本专题建模中的数据广度问题,现有方法通常依靠数据增强或短文本中的数据特征来引入更多的单词共发信息,但大多数方法没有充分利用扩大的数据或数据特征:它们没有充分了解数据样本之间的关系,导致音义上相似的文本配对的不同专题分布。为了更好地解决数据宽度问题,我们在本文件中提出一个新的短文本主题建模框架,即专题-语言对立主题模型(TSCTM)。为了充分建模样本之间的关系,我们采用了新的对比式学习方法,以基于专题语义的有效正面和负面抽样战略。这种反比式学习方法完善了表态,丰富了学习信号,从而缓解了紧张性问题。广泛的实验结果显示,我们的TSCTM超越了最新的基线,而不论数据扩增程度如何,产生高质量的专题和专题分布。