The number of scientific publications nowadays is rapidly increasing, causing information overload for researchers and making it hard for scholars to keep up to date with current trends and lines of work. Consequently, recent work on applying text mining technologies for scholarly publications has investigated the application of automatic text summarization technologies, including extreme summarization, for this domain. However, previous work has concentrated only on monolingual settings, primarily in English. In this paper, we fill this research gap and present an abstractive cross-lingual summarization dataset for four different languages in the scholarly domain, which enables us to train and evaluate models that process English papers and generate summaries in German, Italian, Chinese and Japanese. We present our new X-SCITLDR dataset for multilingual summarization and thoroughly benchmark different models based on a state-of-the-art multilingual pre-trained model, including a two-stage `summarize and translate' approach and a direct cross-lingual model. We additionally explore the benefits of intermediate-stage training using English monolingual summarization and machine translation as intermediate tasks and analyze performance in zero- and few-shot scenarios.
翻译:目前科学出版物的数量正在迅速增加,给研究人员造成信息超负荷,使学者难以跟上当前趋势和工作路线,因此,最近关于将文本采矿技术应用于学术出版物的工作调查了该领域自动文本汇总技术的应用情况,包括极端摘要化,然而,以前的工作仅集中于单语环境,主要是英语;在本文件中,我们填补了这一研究空白,为学术领域的四种不同语文提供了抽象的跨语言汇总数据集,使我们能够训练和评价处理英文论文的模型,并制作德文、意大利文、中文和日文摘要;我们介绍了我们新的X-SCITLDR数据集,用于多语汇总,并根据最新多语种预先培训模式,对不同模式进行彻底基准,包括两阶段“总结和翻译”办法和直接跨语言模式;我们还探讨了利用英语单语合成和机器翻译作为中间任务进行中期培训的好处,并分析了零和几近情景的绩效。