制作一个MIRACL:跨语言连续流的多语种信息检索 (Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages)

MIRACL (Multilingual Information Retrieval Across a Continuum of Languages) is a multilingual dataset we have built for the WSDM 2023 Cup challenge that focuses on ad hoc retrieval across 18 different languages, which collectively encompass over three billion native speakers around the world. These languages have diverse typologies, originate from many different language families, and are associated with varying amounts of available resources -- including what researchers typically characterize as high-resource as well as low-resource languages. Our dataset is designed to support the creation and evaluation of models for monolingual retrieval, where the queries and the corpora are in the same language. In total, we have gathered over 700k high-quality relevance judgments for around 77k queries over Wikipedia in these 18 languages, where all assessments have been performed by native speakers hired by our team. Our goal is to spur research that will improve retrieval across a continuum of languages, thus enhancing information access capabilities for diverse populations around the world, particularly those that have been traditionally underserved. This overview paper describes the dataset and baselines that we share with the community. The MIRACL website is live at http://miracl.ai/.

翻译：我们为WSDM 2023杯挑战而建立的多语言信息检索系统是一个多语种数据集,重点是18种不同语言的特别检索,这些语言共同包括全世界30亿以上的土著讲员。这些语言类型多种多样,来自许多不同的语言家庭,与各种可用资源相关 -- -- 包括研究人员通常认为高资源和低资源语言。我们的数据集旨在支持创建和评估单语检索模式,查询和公司使用相同语言。我们共收集了700多份高质量的判决,涉及这18种语言的大约77k个查询,所有评估都是由我们团队聘用的土著讲员进行的。我们的目标是激发研究,改善各种语言的检索,从而增强世界各地不同人群,特别是传统上得不到充分服务的人口的信息获取能力。本概览文件描述了我们与社区共享的数据设置和基线。MIRACLL网站在http://miracl.ai/。