用时间- 时间- 智能文件嵌入式探测和跟踪主题 (Topic Detection and Tracking with Time-Aware Document Embeddings)

The time at which a message is communicated is a vital piece of metadata in many real-world natural language processing tasks such as Topic Detection and Tracking (TDT). TDT systems aim to cluster a corpus of news articles by event, and in that context, stories that describe the same event are likely to have been written at around the same time. Prior work on time modeling for TDT takes this into account, but does not well capture how time interacts with the semantic nature of the event. For example, stories about a tropical storm are likely to be written within a short time interval, while stories about a movie release may appear over weeks or months. In our work, we design a neural method that fuses temporal and textual information into a single representation of news documents for event detection. We fine-tune these time-aware document embeddings with a triplet loss architecture, integrate the model into downstream TDT systems, and evaluate the systems on two benchmark TDT data sets in English. In the retrospective setting, we apply clustering algorithms to the time-aware embeddings and show substantial improvements over baselines on the News2013 data set. In the online streaming setting, we add our document encoder to an existing state-of-the-art TDT pipeline and demonstrate that it can benefit the overall performance. We conduct ablation studies on the time representation and fusion algorithm strategies, showing that our proposed model outperforms alternative strategies. Finally, we probe the model to examine how it handles recurring events more effectively than previous TDT systems.

翻译：传递信息的时间是许多真实世界自然语言处理任务中的重要元数据的一部分,如Toption Control和跟踪(TDT) 。 TDT系统旨在将一系列新闻文章按事件分组,在此背景下,描述同一事件的故事很可能是同时撰写的。之前为TDT做时间模型的工作考虑到了这一点,但并不能很好地捕捉时间如何与事件的语义性质发生互动。例如,热带风暴的故事可能会在很短的时间间隔内写成,而关于电影发布的故事可能会在数周或数月内出现。在我们的工作中,我们设计了一个神经神经学方法,将时间和文字信息结合到一个单一的新闻文件演示,供事件探测。我们微调这些有时间意识的文件嵌入三重损失结构,将模型纳入下游的TDT系统,并评估两个以英文为基准的TDT数据集的系统。在回溯的设置中,我们使用对时间识别的计算方法, 并显示在2013年版本数据基底线设置的基线上的实质性改进。在在线流动的进度研究中,我们可以展示我们现有的越演越演越演越演越越越越越演越越演越演越演越演越演越演越演越演越演越演越演越演越演越演越演越好。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/