Since the development of writing 5000 years ago, human-generated data gets produced at an ever-increasing pace. Classical archival methods aimed at easing information retrieval. Nowadays, archiving is not enough anymore. The amount of data that gets generated daily is beyond human comprehension, and appeals for new information retrieval strategies. Instead of referencing every single data piece as in traditional archival techniques, a more relevant approach consists in understanding the overall ideas conveyed in data flows. To spot such general tendencies, a precise comprehension of the underlying data generation mechanisms is required. In the rich literature tackling this problem, the question of information interaction remains nearly unexplored. First, we investigate the frequency of such interactions. Building on recent advances made in Stochastic Block Modelling, we explore the role of interactions in several social networks. We find that interactions are rare in these datasets. Then, we wonder how interactions evolve over time. Earlier data pieces should not have an everlasting influence on ulterior data generation mechanisms. We model this using dynamic network inference advances. We conclude that interactions are brief. Finally, we design a framework that jointly models rare and brief interactions based on Dirichlet-Hawkes Processes. We argue that this new class of models fits brief and sparse interaction modelling. We conduct a large-scale application on Reddit and find that interactions play a minor role in this dataset. From a broader perspective, our work results in a collection of highly flexible models and in a rethinking of core concepts of machine learning. Consequently, we open a range of novel perspectives both in terms of real-world applications and in terms of technical contributions to machine learning.
翻译:自5000年前的写作发展以来,人类生成的数据的生成速度越来越快。 典型的档案处理方法旨在便利信息检索。 现在, 归档已经不够了。 每天生成的数据数量超出了人类的理解范围, 需要新的信息检索战略。 我们发现, 与传统的档案处理技术相比, 更相关的方法不是参照每个数据片, 而是了解数据流中传达的总体想法。 要发现这种一般趋势, 就需要准确理解基本的数据生成机制。 在处理这一问题的丰富文献中, 信息互动问题仍然几乎没有被探讨。 首先, 我们调查这种互动的频率。 借助在Sottachart Block模型中最近取得的进展, 我们探索了几个社会网络中的互动作用。 我们发现, 这些互动是罕见的。 然后, 我们不知道, 早期数据片不应该对数据流机制产生长期的影响。 我们用一个动态网络来模型来模拟一个灵活的进步。 我们的结论是, 互动是简短的。 最后, 我们设计一个基于Drichlet- Hawkekes 模型的快速互动框架, 我们用这个模型来共同模拟这个小层次的模型, 以及机变的模型的模型可以理解。