Learning from positive and unlabeled (PU) data is a setting where the learner only has access to positive and unlabeled samples while having no information on negative examples. Such PU setting is of great importance in various tasks such as medical diagnosis, social network analysis, financial markets analysis, and knowledge base completion, which also tend to be intrinsically imbalanced, i.e., where most examples are actually negatives. Most existing approaches for PU learning, however, only consider artificially balanced datasets and it is unclear how well they perform in the realistic scenario of imbalanced and long-tail data distribution. This paper proposes to tackle this challenge via robust and efficient self-supervised pretraining. However, training conventional self-supervised learning methods when applied with highly imbalanced PU distribution needs better reformulation. In this paper, we present \textit{ImPULSeS}, a unified representation learning framework for \underline{Im}balanced \underline{P}ositive \underline{U}nlabeled \underline{L}earning leveraging \underline{Se}lf-\underline{S}upervised debiase pre-training. ImPULSeS uses a generic combination of large-scale unsupervised learning with debiased contrastive loss and additional reweighted PU loss. We performed different experiments across multiple datasets to show that ImPULSeS is able to halve the error rate of the previous state-of-the-art, even compared with previous methods that are given the true prior. Moreover, our method showed increased robustness to prior misspecification and superior performance even when pretraining was performed on an unrelated dataset. We anticipate such robustness and efficiency will make it much easier for practitioners to obtain excellent results on other PU datasets of interest. The source code is available at \url{https://github.com/JSchweisthal/ImPULSeS}
翻译:从正和未贴标签( PU) 数据中学习是一个设置, 学习者只能获取正和未贴标签的样本, 而没有负面实例的信息。 这种 PU 设置在医学诊断、 社会网络分析、 金融市场分析、 知识基础完成等各种任务中非常重要, 这些任务也往往存在内在的不平衡, 即大多数例子都是负数。 然而, 大部分现有的 PU 学习方法, 都只考虑人为平衡的数据集, 并且不清楚它们是如何在不平衡和长尾数据分布的现实假设中运行的。 本文提议通过稳健和高效的自我监督预培训来应对这一挑战。 然而, 当应用高度不平衡的 PUPU 分布时, 培训常规的自我监督学习方法会更好一些。 本文中我们展示了 kextit{ IPULUS) 的统一代表学习框架 。 之前的 RBODRS 将显示自己在前的排序前的失败率, 将显示在前的排序中, 将显示前的SLS- desuder desudedudeal dedudeal dedudedudededeal distration distration distration@s