Information Retrieval (IR) is an important task and can be used in many applications. Neural IR (Neu-IR) models overcome the vocabulary mismatch problem of sparse retrievers and thrive on the ranking pipeline with semantic matching. Recent progress in IR mainly focuses on Neu-IR models, including efficient dense retrieval, advanced neural architectures and robustly training for few-shot IR that lacks training data. In order to integrate these advantages for researchers and engineers to utilize and develop, OpenMatch provides various functional neural modules based on PyTorch to maintain sufficient extensibility, making it easy to build customized and higher-capacity IR systems. Besides, OpenMatch consists of complicated optimization tricks, various sparse/dense retrieval methods, and advanced few-shot training methods, liberating users from surplus labor in baseline reimplementation and neural model finetuning. With OpenMatch, we achieve reasonable performance on various ranking datasets, rank first of the automatic group in TREC COVID (Round 2) and rank top on the MS MARCO Document Ranking leaderboard. The library, experimental methodologies and results of OpenMatch are all publicly available at https://github.com/thunlp/OpenMatch.
翻译:信息检索(IR)是一项重要任务,可以在许多应用中使用。神经IR(Neu-IR)模型克服了稀有检索器的词汇错配问题,并且以语义匹配的方式在排位管道上蓬勃发展。IR最近的进展主要集中在Neu-IR模型上,包括高效的密集检索、先进的神经结构以及对缺乏培训数据的微粒IR进行强力培训。为了整合研究人员和工程师利用和发展的这些优势,OpenMatch提供了以PyTorch为基础的各种功能性神经模块,以保持足够的可扩展性,使其易于建立定制的和更高能力的IR系统。此外,OpenMatch包括复杂的优化技巧、各种稀有/重现的检索方法和先进的几发式培训方法,使用户在基线再实施和神经模型微调中摆脱多余的劳动力。OpenMatch利用OpenMatch(REC CO CO CO CO CO (Round 2) 自动组的排名第一,并在MS MARCO 文档排名领导板上排名最高。OnMatch/OnMissionMs/OnMs/Orms/Onsm) 都可公开查阅。