We provide a simple and general solution for the discovery of scarce topics in unbalanced short-text datasets, namely, a word co-occurrence network-based model CWIBTD, which can simultaneously address the sparsity and unbalance of short-text topics and attenuate the effect of occasional pairwise occurrences of words, allowing the model to focus more on the discovery of scarce topics. Unlike previous approaches, CWIBTD uses co-occurrence word networks to model the topic distribution of each word, which improves the semantic density of the data space and ensures its sensitivity in identify-ing rare topics by improving the way node activity is calculated and normal-izing scarce topics and large topics to some extent. In addition, using the same Gibbs sampling as LDA makes CWIBTD easy to be extended to vari-ous application scenarios. Extensive experimental validation in the unbal-anced short text dataset confirms the superiority of CWIBTD over the base-line approach in discovering rare topics. Our model can be used for early and accurate discovery of emerging topics or unexpected events on social platforms.
翻译:我们为在不平衡的短文本数据集中发现稀缺专题提供了简单和一般的解决办法,即一个基于单词的共同网络网络模型CWIBTD,它既能解决短文本专题的宽度和不平衡性,又能减少偶发双词偶发的影响,使该模型能够更加侧重于发现稀缺专题。与以往的做法不同,CWIBTD使用共发词网络来模拟每个单词的专题分布,这提高了数据空间的语义密度,并通过改进节点活动的计算方式,确保它在确定稀有专题时具有敏感性,使稀有专题和大专题正常化。此外,使用与LDA一样的Gibs抽样使CWIBTD易于推广到变异应用设想中。在不平式短文本数据集中进行广泛的实验性验证证实了CWIBTD在发现稀有专题方面优于基线方法。我们的模型可用于早期和准确发现新出现的专题或社会平台上的意外事件。