Unsupervised discovery of stories with correlated news articles in real-time helps people digest massive news streams without expensive human annotations. A common approach of the existing studies for unsupervised online story discovery is to represent news articles with symbolic- or graph-based embedding and incrementally cluster them into stories. Recent large language models are expected to improve the embedding further, but a straightforward adoption of the models by indiscriminately encoding all information in articles is ineffective to deal with text-rich and evolving news streams. In this work, we propose a novel thematic embedding with an off-the-shelf pretrained sentence encoder to dynamically represent articles and stories by considering their shared temporal themes. To realize the idea for unsupervised online story discovery, a scalable framework USTORY is introduced with two main techniques, theme- and time-aware dynamic embedding and novelty-aware adaptive clustering, fueled by lightweight story summaries. A thorough evaluation with real news data sets demonstrates that USTORY achieves higher story discovery performances than baselines while being robust and scalable to various streaming settings.
翻译:摘要: 无人标注的实时相关新闻文章的故事发现有助于人们处理庞大的新闻流,而无需昂贵的人工注释。现有研究无监督在线故事发现的常见方法是将新闻文章用符号或基于图形的嵌入表示,并逐渐将它们聚类成故事。最近的大型语言模型有望提高嵌入效果,但为了处理丰富的文本流和不断变化的新闻流,不加选择地编码所有信息是无效的。在此工作中,我们提出了一种新颖的主题嵌入方法,使用现成的预训练句子编码器动态地表示文章和故事,并考虑它们共享的时间主题。为了实现在线无监督的故事发现,我们引入了一个可扩展的框架USTORY,具有两个主要技术:主题和时间感知的动态嵌入和新颖感知的自适应聚类,采用轻量级故事概要生成。对实际新闻数据集的彻底评估表明,USTORY在具有鲁棒性和可扩展性的同时,比基线方法实现了更高的故事发现性能。