Passage ranking involves two stages: passage retrieval and passage re-ranking, which are important and challenging topics for both academics and industries in the area of Information Retrieval (IR). However, the commonly-used datasets for passage ranking usually focus on the English language. For non-English scenarios, such as Chinese, the existing datasets are limited in terms of data scale, fine-grained relevance annotation and false negative issues. To address this problem, we introduce T2Ranking, a large-scale Chinese benchmark for passage ranking. T2Ranking comprises more than 300K queries and over 2M unique passages from real-world search engines. Expert annotators are recruited to provide 4-level graded relevance scores (fine-grained) for query-passage pairs instead of binary relevance judgments (coarse-grained). To ease the false negative issues, more passages with higher diversities are considered when performing relevance annotations, especially in the test set, to ensure a more accurate evaluation. Apart from the textual query and passage data, other auxiliary resources are also provided, such as query types and XML files of documents which passages are generated from, to facilitate further studies. To evaluate the dataset, commonly used ranking models are implemented and tested on T2Ranking as baselines. The experimental results show that T2Ranking is challenging and there is still scope for improvement. The full data and all codes are available at https://github.com/THUIR/T2Ranking/
翻译:段落排序包括两个阶段:段落检索和段落重新排序,对于信息检索(IR)领域的学术界和工业界来说,这两个主题都是重要且具有挑战性的。然而,常用的段落排序数据集通常集中在英语中。对于非英语场景,如汉语,现有数据集在数据规模、精细的关联注释和假阴性问题方面都受到限制。为解决这个问题,我们推出了T2Ranking,一份大规模的中文段落排序基准。包含超过30万个查询和超过200万个来自真实搜索引擎的独特段落。我们聘请专家注释人员为查询-段落对提供4个级别的分级关联分数(精细的),而不是二元关联判断(粗略的)。为了缓解假阴性问题,我们在执行关联注释时,特别是在测试集中,考虑更多的具有更高差异性的段落,以确保更精确的评估。除文本查询和段落数据外,还提供其他辅助资源,例如查询类型和段落生成文档的XML文件,以便进一步研究。为了评估数据集,实现了通常用于排名的模型,并在T2Ranking上作为基线进行测试。实验结果表明,T2Ranking具有挑战性,并且仍有改进空间。完整数据和所有代码可在https://github.com/THUIR/T2Ranking/上进行下载。