The time at which a message is communicated is a vital piece of metadata in many real-world natural language processing tasks such as Topic Detection and Tracking (TDT). TDT systems aim to cluster a corpus of news articles by event, and in that context, stories that describe the same event are likely to have been written at around the same time. Prior work on time modeling for TDT takes this into account, but does not well capture how time interacts with the semantic nature of the event. For example, stories about a tropical storm are likely to be written within a short time interval, while stories about a movie release may appear over weeks or months. In our work, we design a neural method that fuses temporal and textual information into a single representation of news documents for event detection. We fine-tune these time-aware document embeddings with a triplet loss architecture, integrate the model into downstream TDT systems, and evaluate the systems on two benchmark TDT data sets in English. In the retrospective setting, we apply clustering algorithms to the time-aware embeddings and show substantial improvements over baselines on the News2013 data set. In the online streaming setting, we add our document encoder to an existing state-of-the-art TDT pipeline and demonstrate that it can benefit the overall performance. We conduct ablation studies on the time representation and fusion algorithm strategies, showing that our proposed model outperforms alternative strategies. Finally, we probe the model to examine how it handles recurring events more effectively than previous TDT systems.
翻译:传递信息的时间是许多真实世界自然语言处理任务中的重要元数据的一部分,如Toption Control和跟踪(TDT) 。 TDT系统旨在将一系列新闻文章按事件分组,在此背景下,描述同一事件的故事很可能是同时撰写的。 之前为TDT做时间模型的工作考虑到了这一点,但并不能很好地捕捉时间如何与事件的语义性质发生互动。 例如,热带风暴的故事可能会在很短的时间间隔内写成,而关于电影发布的故事可能会在数周或数月内出现。 在我们的工作中,我们设计了一个神经神经学方法,将时间和文字信息结合到一个单一的新闻文件演示,供事件探测。 我们微调这些有时间意识的文件嵌入三重损失结构,将模型纳入下游的TDT系统,并评估两个以英文为基准的TDT数据集的系统。 在回溯的设置中,我们使用对时间识别的计算方法, 并显示在2013年版本数据基底线设置的基线上的实质性改进。 在在线流动的进度研究中,我们可以展示我们现有的越演越演越演越演越越越越越演越越演越演越演越演越演越演越演越演越演越演越演越演越演越演越演越好。