With the ever-growing volume of online news feeds, event-based organization of news articles has many practical applications including better information navigation and the ability to view and analyze events as they develop. Automatically tracking the evolution of events in large news corpora still remains a challenging task, and the existing techniques for Event Detection and Tracking do not place a particular focus on tracking events in very large and constantly updating news feeds. Here, we propose a new method for robust and efficient event detection and tracking, which we call RevDet algorithm. RevDet adopts an iterative clustering approach for tracking events. Even though many events continue to develop for many days or even months, RevDet is able to detect and track those events while utilizing only a constant amount of space on main memory. We also devise a redundancy removal strategy which effectively eliminates duplicate news articles and substantially reduces the size of data. We construct a large, comprehensive new ground truth dataset specifically for event detection and tracking approaches by augmenting two existing datasets: w2e and GDELT. We implement RevDet algorithm and evaluate its performance on the ground truth event chains. We discover that our algorithm is able to accurately recover event chains in the ground-truth dataset. We also compare the memory efficiency of our algorithm with the standard single pass clustering approach, and demonstrate the appropriateness of our algorithm for event detection and tracking task in large news feeds.
翻译:随着在线新闻资料数量的不断增加,以事件为基础的新闻文章组织有许多实际应用,包括更好的信息导航以及随着事件的发展对事件进行观察和分析的能力。自动跟踪大型新闻公司的事件演变情况仍然是一项艰巨的任务,事件探测和跟踪的现有技术并不特别侧重于在非常大和不断更新的新闻资料中跟踪事件。在这里,我们提出一个强有力和高效的事件探测和跟踪的新方法,我们称之为RevDet 算法。RevDet 采用一种迭代集法来跟踪事件跟踪。尽管许多事件继续发展许多天甚至几个月,但RevDet 能够探测和跟踪这些事件,同时只利用主要记忆中的固定空间。我们还设计了一项冗余清除战略,有效地消除重复的新闻报道,并大幅缩小数据规模。我们专门为事件探测和跟踪方法建立一个大型、全面的新的地面真相数据集,加强两个现有的数据集:W2e和GDELT。我们采用RevDet 算法,并评价其在实地事件链上的表现。我们发现,我们的算法能够精确地将事件链与我们的标准数据序列进行对比,在地面调查中,我们还能够精确地将我们的标准记录和大规模记录数据分析。