CODEC: 复杂文件和实体汇编 (CODEC: Complex Document and Entity Collection)

CODEC is a document and entity ranking benchmark that focuses on complex research topics. We target essay-style information needs of social science researchers, i.e. "How has the UK's Open Banking Regulation benefited Challenger Banks?". CODEC includes 42 topics developed by researchers and a new focused web corpus with semantic annotations including entity links. This resource includes expert judgments on 17,509 documents and entities (416.9 per topic) from diverse automatic and interactive manual runs. The manual runs include 387 query reformulations, providing data for query performance prediction and automatic rewriting evaluation. CODEC includes analysis of state-of-the-art systems, including dense retrieval and neural re-ranking. The results show the topics are challenging with headroom for document and entity ranking improvement. Query expansion with entity information shows significant gains on document ranking, demonstrating the resource's value for evaluating and improving entity-oriented search. We also show that the manual query reformulations significantly improve the performance of document and entity ranking. Overall, CODEC provides challenging research topics to support the development and evaluation of entity-centric search methods.

翻译：CODEC是一个侧重于复杂研究专题的文件和实体排名基准,我们针对社会科学研究者在作文式信息方面的需要,即“英国开放银行监管如何使挑战者银行受益?” CODEC包括42个由研究人员开发的专题和一个新的重点网络资料,包括实体链接,包括语义说明;这一资源包括对17 509份文件和实体(每个专题416.9个)的不同自动和互动式人工操作进行的专家判断;手册包括387个查询重订,为查询性业绩预测和自动重写评价提供数据;CODEC包括分析最先进的系统,包括密集检索和神经重排;结果显示,这些专题在改进文件主机和实体排名方面存在挑战;实体信息的扩展表明在文件排名方面取得了重大收益,显示了资源对评价和改进实体导向搜索的价值;我们还表明,人工重订,大大改进了文件和实体排名的绩效。总之,CODEC提供了具有挑战性的研究课题,以支持实体中心搜索方法的发展和评价。