CroCosum:跨语言代码抽动摘要的基准数据集</s> (CroCoSum: A Benchmark Dataset for Cross-Lingual Code-Switched Summarization) - 专知论文

会员服务 ·

0

Performer · 数据集 · ForCES · HTTPS · 情景 ·

2023 年 3 月 7 日

CroCoSum: A Benchmark Dataset for Cross-Lingual Code-Switched Summarization

翻译：CroCosum:跨语言代码抽动摘要的基准数据集

Ruochen Zhang,Carsten Eickhoff

from arxiv, Work in Progress

Cross-lingual summarization (CLS) has attracted increasing interest in recent years due to the availability of large-scale web-mined datasets and the advancements of multilingual language models. However, given the rareness of naturally occurring CLS resources, the majority of datasets are forced to rely on translation which can contain overly literal artifacts. This restricts our ability to observe naturally occurring CLS pairs that capture organic diction, including instances of code-switching. This alteration between languages in mid-message is a common phenomenon in multilingual settings yet has been largely overlooked in cross-lingual contexts due to data scarcity. To address this gap, we introduce CroCoSum, a dataset of cross-lingual code-switched summarization of technology news. It consists of over 24,000 English source articles and 18,000 human-curated Chinese news summaries, with more than 92% of the summaries containing code-switched phrases. For reference, we evaluate the performance of existing approaches including pipeline, end-to-end, and zero-shot methods. We show that leveraging existing resources as a pretraining step does not improve performance on CroCoSum, indicating the limited generalizability of existing resources. Finally, we discuss the challenges of evaluating cross-lingual summarizers on code-switched generation through qualitative error analyses. Our collection and code can be accessed at https://github.com/RosenZhang/CroCoSum.

翻译：近年来,由于大规模网络驱动数据集的可用性和多语言模式的进步,跨语言类集(CLS)近年来引起了越来越多的兴趣。然而,鉴于自然产生的CLS资源十分罕见,大多数数据集被迫依赖翻译,而翻译中可能包含过量的人工工艺品。这限制了我们观测自然产生的包含有机字典的CLS配对的能力,包括代码转换实例。中语中语言的改变是多语种环境中的一种常见现象,但由于数据稀缺,多语种环境中的多语种环境中基本上忽视了这种现象。为了解决这一差距,我们引入了CroCoSum,这是一套跨语言代码转换的对技术新闻的汇总数据集。它由24 000多篇英文来源文章和18 000多份人文版中国新闻摘要组成,超过92%的LOFS摘要包含代码转换短语。我们评估现有方法的绩效,包括管道、终端到终端和零镜头方法。我们显示,将现有资源作为跨语言类组/网络的预培训步骤,不会改进CroCOS-CROS生成分析的绩效。最后,我们通过COCO-CS-crocalasservical dassal dalassalalalal 分析,我们现有代码分析的流程/calvidudustration)。我们在总体分析中可以评估现有代码分析中进行有限的分析。</s>

0

相关内容

Performer

【MIT Sam Hopkins】如何读论文？How to Read a Paper

【MIT Sam Hopkins】如何读论文？How to Read a Paper

专知会员服务

108+阅读 · 2022年3月20日

【如何做研究】How to research ，22页ppt

【如何做研究】How to research ，22页ppt

专知会员服务

114+阅读 · 2021年4月17日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

全球人工智能

20+阅读 · 2017年12月17日

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

机器学习研究会

20+阅读 · 2017年12月17日

【推荐】用Tensorflow理解LSTM

【推荐】用Tensorflow理解LSTM

机器学习研究会

36+阅读 · 2017年9月11日

MicroRNA调控BACE1在AD发病中的作用与机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

lncRNA在类风湿性关节炎中的调控网络及分子功能机制

国家自然科学基金

0+阅读 · 2013年12月31日

海洋天然产物Lamellarin D糖基化衍生物的合成与构效关系研究

国家自然科学基金

0+阅读 · 2013年12月31日

细胞缝隙连接蛋白Connexin43通过Nrf2/ARE信号通路介导糖尿病肾脏纤维化的研究

国家自然科学基金

0+阅读 · 2013年12月31日

抵抗素在膀胱癌发生发展中的作用及机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

MicRNA107调控BACE1mRNA基因与阿尔茨海默病内质网应激病理机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

lncRNA-UCA1通过PKM2参与膀胱癌细胞Warburg效应的机制

国家自然科学基金

0+阅读 · 2012年12月31日

天然产物Artanomalide D及其类似物的全合成和抗肿瘤构效关系研究

国家自然科学基金

0+阅读 · 2012年12月31日

PPARγ和ANGPTL4基因表达在急性胰腺炎肺损伤发病机制中的作用及清胰汤的干预作用

国家自然科学基金

0+阅读 · 2011年12月31日

整合素受体介导Re-188标记的新型多肽分子探针用于肿瘤显像与治疗实验研究

国家自然科学基金

0+阅读 · 2009年12月31日

Large Language Models Are State-of-the-Art Evaluators of Code Generation

Large Language Models Are State-of-the-Art Evaluators of Code Generation

Arxiv

0+阅读 · 2023年4月27日

Model and Data Transfer for Cross-Lingual Sequence Labelling in Zero-Resource Settings

Arxiv

0+阅读 · 2023年4月27日

ChartSumm: A Comprehensive Benchmark for Automatic Chart Summarization of Long and Short Summaries

ChartSumm: A Comprehensive Benchmark for Automatic Chart Summarization of Long and Short Summaries

Arxiv

0+阅读 · 2023年4月26日

Exploiting the Partly Scratch-off Lottery Ticket for Quantization-Aware Training

Arxiv

0+阅读 · 2023年4月25日

Hitachi at SemEval-2023 Task 3: Exploring Cross-lingual Multi-task Strategies for Genre and Framing Detection in Online News

Arxiv

0+阅读 · 2023年4月25日

Chinese Open Instruction Generalist: A Preliminary Release

Arxiv

0+阅读 · 2023年4月25日

Sequence Level Contrastive Learning for Text Summarization

Sequence Level Contrastive Learning for Text Summarization

Arxiv

14+阅读 · 2021年9月24日

Neural Architecture Search without Training

Neural Architecture Search without Training

Arxiv

10+阅读 · 2021年6月11日

PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization

Arxiv

17+阅读 · 2020年6月2日

Enhanced Meta-Learning for Cross-lingual Named Entity Recognition with Minimal Resources

Arxiv

13+阅读 · 2019年11月14日

VIP会员

文章信息

相关主题

相关VIP内容

【MIT Sam Hopkins】如何读论文？How to Read a Paper

【MIT Sam Hopkins】如何读论文？How to Read a Paper

专知会员服务

108+阅读 · 2022年3月20日

【如何做研究】How to research ，22页ppt

【如何做研究】How to research ，22页ppt

专知会员服务

114+阅读 · 2021年4月17日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《代码、指挥与冲突：描绘军事人工智能的未来》报告

【斯坦福博士论文】面向地理空间数据的多模态与多尺度建模：时空生成式人工智能

美国启动“自有军事人工智能计划”：采用谷歌Gemini以推动全军人工智能应用

《创新与适应性作为军事成功的关键因素：来自俄乌战争的战略洞见》报告

相关资讯

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

全球人工智能

20+阅读 · 2017年12月17日

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

机器学习研究会

20+阅读 · 2017年12月17日

【推荐】用Tensorflow理解LSTM

【推荐】用Tensorflow理解LSTM

机器学习研究会

36+阅读 · 2017年9月11日

相关论文

Large Language Models Are State-of-the-Art Evaluators of Code Generation

Large Language Models Are State-of-the-Art Evaluators of Code Generation

Arxiv

0+阅读 · 2023年4月27日

Model and Data Transfer for Cross-Lingual Sequence Labelling in Zero-Resource Settings

Arxiv

0+阅读 · 2023年4月27日

ChartSumm: A Comprehensive Benchmark for Automatic Chart Summarization of Long and Short Summaries

ChartSumm: A Comprehensive Benchmark for Automatic Chart Summarization of Long and Short Summaries

Arxiv

0+阅读 · 2023年4月26日

Exploiting the Partly Scratch-off Lottery Ticket for Quantization-Aware Training

Arxiv

0+阅读 · 2023年4月25日

Hitachi at SemEval-2023 Task 3: Exploring Cross-lingual Multi-task Strategies for Genre and Framing Detection in Online News

Arxiv

0+阅读 · 2023年4月25日

Chinese Open Instruction Generalist: A Preliminary Release

Arxiv

0+阅读 · 2023年4月25日

Sequence Level Contrastive Learning for Text Summarization

Sequence Level Contrastive Learning for Text Summarization

Arxiv

14+阅读 · 2021年9月24日

Neural Architecture Search without Training

Neural Architecture Search without Training

Arxiv

10+阅读 · 2021年6月11日

PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization

Arxiv

17+阅读 · 2020年6月2日

Enhanced Meta-Learning for Cross-lingual Named Entity Recognition with Minimal Resources

Arxiv

13+阅读 · 2019年11月14日

相关基金

MicroRNA调控BACE1在AD发病中的作用与机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

lncRNA在类风湿性关节炎中的调控网络及分子功能机制

国家自然科学基金

0+阅读 · 2013年12月31日

海洋天然产物Lamellarin D糖基化衍生物的合成与构效关系研究

国家自然科学基金

0+阅读 · 2013年12月31日

细胞缝隙连接蛋白Connexin43通过Nrf2/ARE信号通路介导糖尿病肾脏纤维化的研究

国家自然科学基金

0+阅读 · 2013年12月31日

抵抗素在膀胱癌发生发展中的作用及机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

MicRNA107调控BACE1mRNA基因与阿尔茨海默病内质网应激病理机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

lncRNA-UCA1通过PKM2参与膀胱癌细胞Warburg效应的机制

国家自然科学基金

0+阅读 · 2012年12月31日

天然产物Artanomalide D及其类似物的全合成和抗肿瘤构效关系研究

国家自然科学基金

0+阅读 · 2012年12月31日

PPARγ和ANGPTL4基因表达在急性胰腺炎肺损伤发病机制中的作用及清胰汤的干预作用

国家自然科学基金

0+阅读 · 2011年12月31日

整合素受体介导Re-188标记的新型多肽分子探针用于肿瘤显像与治疗实验研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员