We consider the situation in which a user has collected a small set of documents on a cohesive topic, and they want to retrieve additional documents on this topic from a large collection. Information Retrieval (IR) solutions treat the document set as a query, and look for similar documents in the collection. We propose to extend the IR approach by treating the problem as an instance of positive-unlabeled (PU) learning -- i.e., learning binary classifiers from only positive and unlabeled data, where the positive data corresponds to the query documents, and the unlabeled data is the results returned by the IR engine. Utilizing PU learning for text with big neural networks is a largely unexplored field. We discuss various challenges in applying PU learning to the setting, including an unknown class prior, extremely imbalanced data and large-scale accurate evaluation of models, and we propose solutions and empirically validate them. We demonstrate the effectiveness of the method using a series of experiments of retrieving PubMed abstracts adhering to fine-grained topics. We demonstrate improvements over the base IR solution and other baselines.
翻译:我们考虑了一个用户就一个具有凝聚力的专题收集了一小部分文件的情况,他们希望从一个大型的收藏中检索关于这个主题的更多文件。信息检索(IR)解决方案将文件当作查询,并在收藏中寻找类似的文件。我们提议扩大IR方法,将问题作为正面的、无标签的学习(PU)实例来对待,即只从正面和无标签的数据中学习二进制分类器,其中正数据与查询文件相对应,而未贴标签的数据是IR引擎返回的结果。利用PU学习大神经网络的文本是一个基本上尚未探索的领域。我们讨论了在将PU学习应用到设置中时遇到的各种挑战,包括一个未知的类别,极不平衡的数据和对模型的大规模准确评价,我们提出了解决办法和经验验证。我们用一系列的实验来证明该方法的有效性,我们用重新检索PubMed摘要的实验来证实精细的题目。我们展示了基础的IR解决方案和其他基线的改进情况。