Learned representations of scientific documents can serve as valuable input features for downstream tasks, without the need for further fine-tuning. However, existing benchmarks for evaluating these representations fail to capture the diversity of relevant tasks. In response, we introduce SciRepEval, the first comprehensive benchmark for training and evaluating scientific document representations. It includes 25 challenging and realistic tasks, 11 of which are new, across four formats: classification, regression, ranking and search. We then use the benchmark to study and improve the generalization ability of scientific document representation models. We show how state-of-the-art models struggle to generalize across task formats, and that simple multi-task training fails to improve them. However, a new approach that learns multiple embeddings per document, each tailored to a different format, can improve performance. We experiment with task-format-specific control codes and adapters in a multi-task setting and find that they outperform the existing single-embedding state-of-the-art by up to 1.5 points absolute.
翻译:科学文件的学术表述可以作为下游任务的宝贵投入特征,而无需进一步细微调整。然而,现有的评估这些表述的基准未能反映相关任务的多样性。作为回应,我们引入了SciRepEval,这是培训和评估科学文件表述的第一个全面基准,其中包括25项具有挑战性和现实性的任务,其中11项是新的,有四种格式:分类、回归、排名和搜索。然后我们使用该基准研究和提高科学文件表述模型的普遍化能力。我们展示了最先进的模型如何努力在任务格式之间实现普遍化,而简单的多任务培训未能改进这些格式。然而,一种学习每个文件多重嵌入的新方法,每个文件都适合不同格式,可以改进绩效。我们在多任务设置中试验特定任务格式的控制代码和适应者,发现它们比现有的单组装状态高出1.5个百分点的绝对值。