纵向搜索具体领域培训:生物医学文献案例研究 (Domain-Specific Pretraining for Vertical Search: Case Study on Biomedical Literature)

Yu Wang,Jinchao Li,Tristan Naumann,Chenyan Xiong,Hao Cheng,Robert Tinn,Cliff Wong,Naoto Usuyama,Richard Rogahn,Zhihong Shen,Yang Qin,Eric Horvitz,Paul N. Bennett,Jianfeng Gao,Hoifung Poon

Information overload is a prevalent challenge in many high-value domains. A prominent case in point is the explosion of the biomedical literature on COVID-19, which swelled to hundreds of thousands of papers in a matter of months. In general, biomedical literature expands by two papers every minute, totalling over a million new papers every year. Search in the biomedical realm, and many other vertical domains is challenging due to the scarcity of direct supervision from click logs. Self-supervised learning has emerged as a promising direction to overcome the annotation bottleneck. We propose a general approach for vertical search based on domain-specific pretraining and present a case study for the biomedical domain. Despite being substantially simpler and not using any relevance labels for training or development, our method performs comparably or better than the best systems in the official TREC-COVID evaluation, a COVID-related biomedical search competition. Using distributed computing in modern cloud infrastructure, our system can scale to tens of millions of articles on PubMed and has been deployed as Microsoft Biomedical Search, a new search experience for biomedical literature: https://aka.ms/biomedsearch.

翻译：在许多高价值领域,信息超载是一个普遍的挑战。一个突出的例子是COVID-19生物医学文献的爆炸,在几个月的时间里上升到数十万篇论文。一般而言,生物医学文献每分钟增加两份论文,每年共100万份新论文。在生物医学领域和许多其他纵向领域,由于缺少点击日志的直接监督,搜索是一项挑战。自我监督的学习已成为克服注释瓶颈的一个有希望的方向。我们提出了一个基于特定领域培训前的纵向搜索的一般方法,并提出了生物医学领域的案例研究。尽管我们的方法比官方的TREC-COVID评估(COVID相关生物医学搜索竞赛)中的最佳系统要简单得多,甚至更好。利用现代云层基础设施的分布式计算,我们的系统可以达到数千万篇关于普布麦德的文章,并被作为微软生物医学搜索,这是生物医学文献的新搜索经验:https://akas/biomeds。