Scientific talks are a growing medium for disseminating research, and automatically identifying relevant literature that grounds or enriches a talk would be highly valuable for researchers and students alike. We introduce Reference Prediction from Talks (RPT), a new task that maps long, and unstructured scientific presentations to relevant papers. To support research on RPT, we present Talk2Ref, the first large-scale dataset of its kind, containing 6,279 talks and 43,429 cited papers (26 per talk on average), where relevance is approximated by the papers cited in the talk's corresponding source publication. We establish strong baselines by evaluating state-of-the-art text embedding models in zero-shot retrieval scenarios, and propose a dual-encoder architecture trained on Talk2Ref. We further explore strategies for handling long transcripts, as well as training for domain adaptation. Our results show that fine-tuning on Talk2Ref significantly improves citation prediction performance, demonstrating both the challenges of the task and the effectiveness of our dataset for learning semantic representations from spoken scientific content. The dataset and trained models are released under an open license to foster future research on integrating spoken scientific communication into citation recommendation systems.
翻译:学术报告作为一种日益重要的研究传播媒介,若能自动识别支撑或丰富报告内容的相关文献,将对研究人员和学生具有重要价值。本文提出“学术报告引文预测”这一新任务,旨在将冗长且非结构化的学术报告内容映射至相关论文。为支持该任务研究,我们发布了首个大规模数据集Talk2Ref,包含6,279场学术报告及43,429篇引用论文(平均每场报告引用26篇),其中相关性通过报告对应来源出版物中的引用论文进行近似标注。我们通过评估零样本检索场景下的前沿文本嵌入模型建立了强基线,并提出了基于Talk2Ref训练的双编码器架构。此外,我们探索了处理长文本转录本的策略以及领域自适应训练方法。实验结果表明,基于Talk2Ref的微调显著提升了引文预测性能,既揭示了该任务的挑战性,也证明了本数据集在从口语化科学内容学习语义表征方面的有效性。数据集与训练模型已通过开放许可发布,以推动未来关于将口语化科学交流整合至引文推荐系统的研究。