Learning scientific document representations can be substantially improved through contrastive learning objectives, where the challenge lies in creating positive and negative training samples that encode the desired similarity semantics. Prior work relies on discrete citation relations to generate contrast samples. However, discrete citations enforce a hard cut-off to similarity. This is counter-intuitive to similarity-based learning, and ignores that scientific papers can be very similar despite lacking a direct citation - a core problem of finding related research. Instead, we use controlled nearest neighbor sampling over citation graph embeddings for contrastive learning. This control allows us to learn continuous similarity, to sample hard-to-learn negatives and positives, and also to avoid collisions between negative and positive samples by controlling the sampling margin between them. The resulting method SciNCL outperforms the state-of-the-art on the SciDocs benchmark. Furthermore, we demonstrate that it can train (or tune) models sample-efficiently, and that it can be combined with recent training-efficient methods. Perhaps surprisingly, even training a general-domain language model this way outperforms baselines pretrained in-domain.
翻译:通过对比性学习目标,可以大大改进学习科学文件的表述方式。 对比性学习目标的挑战在于创建正和负培训样本,将想要的相似语义编码。 先前的工作依靠离散的引用关系来生成对比样本。 但是, 离散的引用会将硬截断与相似性相提并论。 这是对相似性学习的反直觉, 忽略了科学论文尽管缺乏直接引用( 寻找相关研究的核心问题 ), 却可以非常相似。 相反, 我们使用对照性最近的近邻抽样, 而不是引用图嵌入的嵌入, 以进行对比性学习。 这种控制可以让我们学习连续的相似性, 抽样的硬读取底片和正片, 并且通过控制样本之间的抽样边距来避免负和正选取样本之间的碰撞。 由此产生的 SciNCL 方法比SciDocs 基准的状态要差。 此外, 我们证明它能够培训( 或调) 样样样节度模型, 并且可以与最近的培训效率方法相结合。 也许令人惊讶的是, 甚至培训一个普通语言模型可以超越前的基线。