As the content on the Internet continues to grow, many new dynamically changing and heterogeneous sources of data constantly emerge. A conventional search engine cannot crawl and index at the same pace as the expansion of the Internet. Moreover, a large portion of the data on the Internet is not accessible to traditional search engines. Distributed Information Retrieval (DIR) is a viable solution to this as it integrates multiple shards (resources) and provides a unified access to them. Resource selection is a key component of DIR systems. There is a rich body of literature on resource selection approaches for DIR. A key limitation of the existing approaches is that they primarily use term-based statistical features and do not generally model resource-query and resource-resource relationships. In this paper, we propose a graph neural network (GNN) based approach to learning-to-rank that is capable of modeling resource-query and resource-resource relationships. Specifically, we utilize a pre-trained language model (PTLM) to obtain semantic information from queries and resources. Then, we explicitly build a heterogeneous graph to preserve structural information of query-resource relationships and employ GNN to extract structural information. In addition, the heterogeneous graph is enriched with resource-resource type of edges to further enhance the ranking accuracy. Extensive experiments on benchmark datasets show that our proposed approach is highly effective in resource selection. Our method outperforms the state-of-the-art by 6.4% to 42% on various performance metrics.
翻译:随着互联网内容不断增长,许多新的动态变化和异构数据源不断涌现。传统搜索引擎无法以与互联网扩张相同的速度爬行和索引,并且互联网上的大部分数据对传统搜索引擎不可访问。分布式信息检索(DIR)是此问题的可行解决方案,它集成了多个分片(资源)并提供了对它们的统一访问。资源选择是DIR系统的关键组成部分。现有方法的主要限制是它们主要使用基于术语的统计特征,并且通常不会对资源 - 查询和资源 - 资源关系进行建模。在本文中,我们提出了一种基于图神经网络(GNN)的学习到排名的方法,能够建模资源 - 查询和资源 - 资源关系。具体地,我们利用预训练的语言模型(PTLM)从查询和资源中获取语义信息。然后,我们明确地构建了一个异构图来保留查询 - 资源关系的结构信息,并采用GNN来提取结构信息。另外,异构图还用资源 - 资源类型的边进行增强,以进一步提高排名准确性。基准数据集上的广泛实验显示,我们提出的方法在资源选择中非常有效。我们的方法在各种性能指标上比现有最先进方法提升了6.4%到42%。