Cross-lingual summarization (CLS) has attracted increasing interest in recent years due to the availability of large-scale web-mined datasets and the advancements of multilingual language models. However, given the rareness of naturally occurring CLS resources, the majority of datasets are forced to rely on translation which can contain overly literal artifacts. This restricts our ability to observe naturally occurring CLS pairs that capture organic diction, including instances of code-switching. This alteration between languages in mid-message is a common phenomenon in multilingual settings yet has been largely overlooked in cross-lingual contexts due to data scarcity. To address this gap, we introduce CroCoSum, a dataset of cross-lingual code-switched summarization of technology news. It consists of over 24,000 English source articles and 18,000 human-curated Chinese news summaries, with more than 92% of the summaries containing code-switched phrases. For reference, we evaluate the performance of existing approaches including pipeline, end-to-end, and zero-shot methods. We show that leveraging existing resources as a pretraining step does not improve performance on CroCoSum, indicating the limited generalizability of existing resources. Finally, we discuss the challenges of evaluating cross-lingual summarizers on code-switched generation through qualitative error analyses. Our collection and code can be accessed at https://github.com/RosenZhang/CroCoSum.
翻译:近年来,由于大规模网络驱动数据集的可用性和多语言模式的进步,跨语言类集(CLS)近年来引起了越来越多的兴趣。然而,鉴于自然产生的CLS资源十分罕见,大多数数据集被迫依赖翻译,而翻译中可能包含过量的人工工艺品。这限制了我们观测自然产生的包含有机字典的CLS配对的能力,包括代码转换实例。中语中语言的改变是多语种环境中的一种常见现象,但由于数据稀缺,多语种环境中的多语种环境中基本上忽视了这种现象。为了解决这一差距,我们引入了CroCoSum,这是一套跨语言代码转换的对技术新闻的汇总数据集。它由24 000多篇英文来源文章和18 000多份人文版中国新闻摘要组成,超过92%的LOFS摘要包含代码转换短语。我们评估现有方法的绩效,包括管道、终端到终端和零镜头方法。我们显示,将现有资源作为跨语言类组/网络的预培训步骤,不会改进CroCOS-CROS生成分析的绩效。最后,我们通过COCO-CS-crocalasservical dassal dalassalalalal 分析,我们现有代码分析的流程/calvidudustration)。我们在总体分析中可以评估现有代码分析中进行有限的分析。</s>