Kterm Hashing provides an innovative approach to novelty detection on massive data streams. Previous research focused on maximizing the efficiency of Kterm Hashing and succeeded in scaling First Story Detection to Twitter-size data stream without sacrificing detection accuracy. In this paper, we focus on improving the effectiveness of Kterm Hashing. Traditionally, all kterms are considered as equally important when calculating a document's degree of novelty with respect to the past. We believe that certain kterms are more important than others and hypothesize that uniform kterm weights are sub-optimal for determining novelty in data streams. To validate our hypothesis, we parameterize Kterm Hashing by assigning weights to kterms based on their characteristics. Our experiments apply Kterm Hashing in a First Story Detection setting and reveal that parameterized Kterm Hashing can surpass state-of-the-art detection accuracy and significantly outperform the uniformly weighted approach.
翻译:Kterm Hashing 为大规模数据流的新发现提供了一种创新方法。 先前的研究侧重于最大限度地提高 Kterm Hashing 的效率, 并成功地将第一次 Streating 放大为Twitter规模的数据流, 同时又不牺牲检测准确性。 在本文中, 我们的重点是提高 Kterm Hashing 的效能。 传统上, 所有 kterms 都被视为在计算文档与过去相比的新程度时同等重要。 我们认为, 某些 kterms 比其他 kterms 更重要, 并且假设, 统一的 kterm 重量是确定数据流中新颖性的次最佳方法。 为了验证我们的假设, 我们根据 kterm Hashing 的特性, 将 Kterm Hashing 参数作为参数, 方法是根据 kterms 的特性给 kterms 分配权重 。 我们的实验将 Kterm Hashing 应用在第一次 Stregication 设置中的 Kterm Hashing, 并揭示参数化 Kterm Hashing 能够超过最先进的检测精度, 并大大超出统一加权方法 。