通过神经积极的无标签学习扩大文件集的可缩放评价和改进 (Scalable Evaluation and Improvement of Document Set Expansion via Neural Positive-Unlabeled Learning)

We consider the situation in which a user has collected a small set of documents on a cohesive topic, and they want to retrieve additional documents on this topic from a large collection. Information Retrieval (IR) solutions treat the document set as a query, and look for similar documents in the collection. We propose to extend the IR approach by treating the problem as an instance of positive-unlabeled (PU) learning -- i.e., learning binary classifiers from only positive and unlabeled data, where the positive data corresponds to the query documents, and the unlabeled data is the results returned by the IR engine. Utilizing PU learning for text with big neural networks is a largely unexplored field. We discuss various challenges in applying PU learning to the setting, including an unknown class prior, extremely imbalanced data and large-scale accurate evaluation of models, and we propose solutions and empirically validate them. We demonstrate the effectiveness of the method using a series of experiments of retrieving PubMed abstracts adhering to fine-grained topics. We demonstrate improvements over the base IR solution and other baselines.

翻译：我们考虑了一个用户就一个具有凝聚力的专题收集了一小部分文件的情况,他们希望从一个大型的收藏中检索关于这个主题的更多文件。信息检索(IR)解决方案将文件当作查询,并在收藏中寻找类似的文件。我们提议扩大IR方法,将问题作为正面的、无标签的学习(PU)实例来对待,即只从正面和无标签的数据中学习二进制分类器,其中正数据与查询文件相对应,而未贴标签的数据是IR引擎返回的结果。利用PU学习大神经网络的文本是一个基本上尚未探索的领域。我们讨论了在将PU学习应用到设置中时遇到的各种挑战,包括一个未知的类别,极不平衡的数据和对模型的大规模准确评价,我们提出了解决办法和经验验证。我们用一系列的实验来证明该方法的有效性,我们用重新检索PubMed摘要的实验来证实精细的题目。我们展示了基础的IR解决方案和其他基线的改进情况。

相关内容

关注 14

信息检索杂志（IR）为信息检索的广泛领域中的理论、算法分析和实验的发布提供了一个国际论坛。感兴趣的主题包括对应用程序（例如Web，社交和流媒体，推荐系统和文本档案）的搜索、索引、分析和评估。这包括对搜索中人为因素的研究、桥接人工智能和信息检索以及特定领域的搜索应用程序。官网地址：https://dblp.uni-trier.de/db/journals/ir/

【ICML2020】深度神经网络置信感知学习，Conﬁdence-Aware Learning for Deep Neural Networks

专知会员服务

74+阅读 · 2020年7月6日

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

和积网络综述论文，Sum-product networks: A survey，24页pdf

专知会员服务

24+阅读 · 2020年4月3日

【伯克利】元学习的元基线，A New Meta-Baseline for Few-Shot Learning

专知会员服务

67+阅读 · 2020年3月28日