Investigative Journalism (IJ, in short) is staple of modern, democratic societies. IJ often necessitates working with large, dynamic sets of heterogeneous, schema-less data sources, which can be structured, semi-structured, or textual, limiting the applicability of classical data integration approaches. In prior work, we have developed ConnectionLens, a system capable of integrating such sources into a single heterogeneous graph, leveraging Information Extraction (IE) techniques; users can then query the graph by means of keywords, and explore query results and their neighborhood using an interactive GUI. Our keyword search problem is complicated by the graph heterogeneity, and by the lack of a result score function that would allow to prune some of the search space. In this work, we describe an actual IJ application studying conflicts of interest in the biomedical domain, and we show how ConnectionLens supports it. Then, we present novel techniques addressing the scalability challenges raised by this application: one allows to reduce the significant IE costs while building the graph, while the other is a novel, parallel, in-memory keyword search engine, which achieves orders of magnitude speed-up over our previous engine. Our experimental study on the real-world IJ application data confirms the benefits of our contributions.
翻译:调查新闻( IJ, 简称 IJ, 简称) 是现代民主社会的主机。 IJ 常常需要与大量动态的多变、无形式、无形式的数据源合作,这些数据源可以结构化、半结构化或文字化,限制古典数据整合方法的可适用性。 在先前的工作中,我们开发了连接Lens, 该系统能够将这类来源整合到单一的多元图中, 利用信息提取( IIE) 技术; 然后用户可以通过关键词查询图, 并使用互动的 GUI 来探索查询结果及其周边。 我们的关键词搜索问题因图表繁杂而复杂, 并且缺乏一个结果评分功能, 使得某些搜索空间变得原始化。 在这项工作中, 我们描述了一个实际的 IJ 应用程序, 研究生物医学领域的利益冲突, 我们展示了连接Lens 如何支持它。 然后, 我们展示了应对该应用程序所引发的可缩略性挑战的新技术: 一种可以降低大量 IE 成本, 并且使用互动的 GUI 。 而另一个是新颖的平行的, 关键词搜索引擎, 也就是一个功能搜索引擎, 能够实现我们对前全球 的实验引擎的应用。