TripClick:大型健康网络搜索引擎的日志文件 (TripClick: The Log Files of a Large Health Web Search Engine)

Click logs are valuable resources for a variety of information retrieval (IR) tasks. This includes query understanding/analysis, as well as learning effective IR models particularly when the models require large amounts of training data. We release a large-scale domain-specific dataset of click logs, obtained from user interactions of the Trip Database health web search engine. Our click log dataset comprises approximately 5.2 million user interactions collected between 2013 and 2020. We use this dataset to create a standard IR evaluation benchmark -- TripClick -- with around 700,000 unique free-text queries and 1.3 million pairs of query-document relevance signals, whose relevance is estimated by two click-through models. As such, the collection is one of the few datasets offering the necessary data richness and scale to train neural IR models with a large amount of parameters, and notably the first in the health domain. Using TripClick, we conduct experiments to evaluate a variety of IR models, showing the benefits of exploiting this data to train neural architectures. In particular, the evaluation results show that the best performing neural IR model significantly improves the performance by a large margin relative to classical IR models, especially for more frequent queries.

翻译：点击日志是各种信息检索(IR)任务的宝贵资源。包括查询理解/ 分析, 以及学习有效的IR模型, 特别是当模型需要大量培训数据时。我们发布大量来自Trip数据库健康网络搜索引擎用户互动的点击日志数据集。我们的点击日志数据集包括2013年至2020年期间收集的大约520万用户互动。我们使用该数据集来创建标准的IR评估基准 -- -- TripClick -- -- 约70万个独特的自由文本查询和130万对查询文件相关信号, 其相关性由两个点击式模型估算。因此, 收集是为数不多的数据集之一, 提供了大量参数对神经IR模型进行必要数据丰富和规模的培训, 特别是卫生领域的第一个。使用TripClick, 我们进行实验来评估各种IR模型, 展示利用这些数据来培训神经结构的好处。特别是, 评估结果显示, 最佳的神经模型模型通过两个点击模式大大改进了与古典IR模型相对的大边缘的性能, 特别是频繁的查询。

相关内容

关注 14

信息检索杂志（IR）为信息检索的广泛领域中的理论、算法分析和实验的发布提供了一个国际论坛。感兴趣的主题包括对应用程序（例如Web，社交和流媒体，推荐系统和文本档案）的搜索、索引、分析和评估。这包括对搜索中人为因素的研究、桥接人工智能和信息检索以及特定领域的搜索应用程序。官网地址：https://dblp.uni-trier.de/db/journals/ir/

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日