XWikiGen：用于低资源语言百科全书生成的跨语言摘要 (XWikiGen: Cross-lingual Summarization for Encyclopedic Text Generation in Low Resource Languages) - 专知论文

会员服务 ·

0

低资源 · 维基百科 · 跨语言 · 文本生成 · 英语维基百科 ·

2023 年 4 月 18 日

XWikiGen: Cross-lingual Summarization for Encyclopedic Text Generation in Low Resource Languages

翻译：XWikiGen：用于低资源语言百科全书生成的跨语言摘要

Dhaval Taunk,Shivprasad Sagare,Anupam Patil,Shivansh Subramanian,Manish Gupta,Vasudeva Varma

Lack of encyclopedic text contributors, especially on Wikipedia, makes automated text generation for low resource (LR) languages a critical problem. Existing work on Wikipedia text generation has focused on English only where English reference articles are summarized to generate English Wikipedia pages. But, for low-resource languages, the scarcity of reference articles makes monolingual summarization ineffective in solving this problem. Hence, in this work, we propose XWikiGen, which is the task of cross-lingual multi-document summarization of text from multiple reference articles, written in various languages, to generate Wikipedia-style text. Accordingly, we contribute a benchmark dataset, XWikiRef, spanning ~69K Wikipedia articles covering five domains and eight languages. We harness this dataset to train a two-stage system where the input is a set of citations and a section title and the output is a section-specific LR summary. The proposed system is based on a novel idea of neural unsupervised extractive summarization to coarsely identify salient information followed by a neural abstractive model to generate the section-specific text. Extensive experiments show that multi-domain training is better than the multi-lingual setup on average.

翻译：缺乏百科全书文本贡献者，特别是在维基百科上，使得低资源语言的自动化文本生成成为一个关键问题。现有的维基百科文本生成工作仅针对英语，其中英语参考文章被概括以生成英语维基百科页面。但对于低资源语言来说，参考文章的稀缺性使得仅使用单语言摘要无法解决这个问题。因此，在本文中，我们提出了XWikiGen，即跨语言多文档摘要，其将来自多种语言的多个参考文章的文本进行概括，以生成维基百科式的文本。因此，我们提供了一个基准数据集XWikiRef，涉及五个领域和八种语言的约69K维基百科文章。我们利用此数据集训练一个两阶段系统，其中输入是一组引用和章节标题，输出是特定章节的低资源摘要。所提出的系统基于一种新颖的神经非监督提取式摘要思想，用于粗略识别显著信息，然后使用一种神经抽象模型生成特定的文本。广泛的实验表明，在平均水平上，多领域训练比多语言设置更好。

0

相关内容

低资源

【COMPTEXT2022教程】跨语言监督文本分类，41页ppt

【COMPTEXT2022教程】跨语言监督文本分类，41页ppt

专知会员服务

18+阅读 · 2022年6月14日

【硬核书】树与网络上的概率，716页pdf

【硬核书】树与网络上的概率，716页pdf

专知会员服务

77+阅读 · 2021年12月8日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

【ICML2020】文本摘要生成模型PEGASUS

【ICML2020】文本摘要生成模型PEGASUS

专知会员服务

35+阅读 · 2020年8月23日

【CMU-TACL2020】低资源跨语言实体链接，Low-resource Crosslingual EntityLinking

专知会员服务

17+阅读 · 2020年3月29日

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

专知会员服务

28+阅读 · 2020年2月12日

【ICLR2020 预训练的百科全书】弱监督的知识-预训练的语言模型（PRETRAINED ENCYCLOPEDIA: WEAKLY SUPERVISED KNOWLEDGE-PRETRAINED LANGUAGE MODEL）

【ICLR2020 预训练的百科全书】弱监督的知识-预训练的语言模型（PRETRAINED ENCYCLOPEDIA: WEAKLY SUPERVISED KNOWLEDGE-PRETRAINED LANGUAGE MODEL）

专知会员服务

25+阅读 · 2019年12月26日

【AAAI2020论文-清华大学】Enhanced Meta-Learning for Cross-lingual Named Entity Recognition with Minimal Resources，最小资源增强的元学习跨语言命名实体识别

【AAAI2020论文-清华大学】Enhanced Meta-Learning for Cross-lingual Named Entity Recognition with Minimal Resources，最小资源增强的元学习跨语言命名实体识别

专知会员服务

31+阅读 · 2019年11月17日

【AAAI2020论文】使用GANs生成科学文章的关键短语（Keyphrase Generation for Scientific Articles using GANs）

【AAAI2020论文】使用GANs生成科学文章的关键短语（Keyphrase Generation for Scientific Articles using GANs）

专知会员服务

22+阅读 · 2019年11月15日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

EMNLP 2022 | ClidSum: 跨语言对话摘要

EMNLP 2022 | ClidSum: 跨语言对话摘要

PaperWeekly

3+阅读 · 2022年11月25日

GNN 新基准！Long Range Graph Benchmark

GNN 新基准！Long Range Graph Benchmark

图与推荐

0+阅读 · 2022年10月18日

【ACL2020放榜!】事件抽取、关系抽取、NER、Few-Shot 相关论文整理

【ACL2020放榜!】事件抽取、关系抽取、NER、Few-Shot 相关论文整理

深度学习自然语言处理

18+阅读 · 2020年5月22日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

论文浅尝 | XQA：一个跨语言开放域问答数据集

论文浅尝 | XQA：一个跨语言开放域问答数据集

开放知识图谱

26+阅读 · 2019年9月11日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

【论文推荐】最新七篇视觉问答（VQA）相关论文—差别注意力机制、视觉问题推理、视觉对话、数据可视化、记忆增强网络、显式推理

【论文推荐】最新七篇视觉问答（VQA）相关论文—差别注意力机制、视觉问题推理、视觉对话、数据可视化、记忆增强网络、显式推理

专知

17+阅读 · 2018年4月19日

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

专知

37+阅读 · 2018年2月21日

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

专知

66+阅读 · 2018年1月31日

二维封闭腔内有单极电荷注入时的流动不稳定性与传热研究

国家自然科学基金

0+阅读 · 2015年12月31日

向量组合学习框架下基于依存混合树的中文语义解析研究

国家自然科学基金

3+阅读 · 2014年12月31日

LincRNA-MALAT1抑制内质网应激在抗胰岛纤维化中的作用和机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

口腔树脂材料中纳米银自生成方法及其相关基础研究

国家自然科学基金

0+阅读 · 2014年12月31日

IRE1α在低切应力诱导血管内皮细胞功能障碍中的作用及其机制

国家自然科学基金

0+阅读 · 2013年12月31日

情感信息抽取的资源建设及关键技术研究

国家自然科学基金

1+阅读 · 2012年12月31日

跨语言信息检索中的机器翻译研究

国家自然科学基金

2+阅读 · 2011年12月31日

中文自动口语摘要技术研究

国家自然科学基金

1+阅读 · 2011年12月31日

藏文字符排序研究

国家自然科学基金

0+阅读 · 2009年12月31日

基于迁移学习的Web挖掘研究

国家自然科学基金

1+阅读 · 2008年12月31日

Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset

Arxiv

0+阅读 · 2023年6月5日

Evaluating Inter-Bilingual Semantic Parsing for Indian Languages

Arxiv

0+阅读 · 2023年6月5日

Cross-Lingual Transfer Learning for Phrase Break Prediction with Multilingual Language Model

Arxiv

0+阅读 · 2023年6月5日

LIDA: A Tool for Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models

Arxiv

0+阅读 · 2023年6月2日

A Survey of Knowledge-Enhanced Text Generation

Arxiv

18+阅读 · 2020年10月9日

CoDEx: A Comprehensive Knowledge Graph Completion Benchmark

Arxiv

10+阅读 · 2020年10月6日

PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization

Arxiv

17+阅读 · 2020年6月2日

Learning Hierarchy-Aware Knowledge Graph Embeddings for Link Prediction

Arxiv

18+阅读 · 2019年12月25日

Enhanced Meta-Learning for Cross-lingual Named Entity Recognition with Minimal Resources

Arxiv

13+阅读 · 2019年11月14日

Text Generation from Knowledge Graphs with Graph Transformers

Arxiv

35+阅读 · 2019年4月4日

VIP会员

文章信息

相关主题

英语维基百科

相关VIP内容

【COMPTEXT2022教程】跨语言监督文本分类，41页ppt

【COMPTEXT2022教程】跨语言监督文本分类，41页ppt

专知会员服务

18+阅读 · 2022年6月14日

【硬核书】树与网络上的概率，716页pdf

【硬核书】树与网络上的概率，716页pdf

专知会员服务

77+阅读 · 2021年12月8日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

【ICML2020】文本摘要生成模型PEGASUS

【ICML2020】文本摘要生成模型PEGASUS

专知会员服务

35+阅读 · 2020年8月23日

【CMU-TACL2020】低资源跨语言实体链接，Low-resource Crosslingual EntityLinking

专知会员服务

17+阅读 · 2020年3月29日

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

专知会员服务

28+阅读 · 2020年2月12日

【ICLR2020 预训练的百科全书】弱监督的知识-预训练的语言模型（PRETRAINED ENCYCLOPEDIA: WEAKLY SUPERVISED KNOWLEDGE-PRETRAINED LANGUAGE MODEL）

【ICLR2020 预训练的百科全书】弱监督的知识-预训练的语言模型（PRETRAINED ENCYCLOPEDIA: WEAKLY SUPERVISED KNOWLEDGE-PRETRAINED LANGUAGE MODEL）

专知会员服务

25+阅读 · 2019年12月26日

【AAAI2020论文-清华大学】Enhanced Meta-Learning for Cross-lingual Named Entity Recognition with Minimal Resources，最小资源增强的元学习跨语言命名实体识别

【AAAI2020论文-清华大学】Enhanced Meta-Learning for Cross-lingual Named Entity Recognition with Minimal Resources，最小资源增强的元学习跨语言命名实体识别

专知会员服务

31+阅读 · 2019年11月17日

【AAAI2020论文】使用GANs生成科学文章的关键短语（Keyphrase Generation for Scientific Articles using GANs）

【AAAI2020论文】使用GANs生成科学文章的关键短语（Keyphrase Generation for Scientific Articles using GANs）

专知会员服务

22+阅读 · 2019年11月15日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

《乌克兰无人机产业：志愿者与政策在构建新兴无人机产业中的协同作用》最新报告

《人工智能辅助决策中的数据可视化：系统性综述》

人工智能驱动弹药制造现代化：美国陆军转型之路

《敏捷作战部署中枢纽-辐条基地选址优化研究》80页

相关资讯

EMNLP 2022 | ClidSum: 跨语言对话摘要

EMNLP 2022 | ClidSum: 跨语言对话摘要

PaperWeekly

3+阅读 · 2022年11月25日

GNN 新基准！Long Range Graph Benchmark

GNN 新基准！Long Range Graph Benchmark

图与推荐

0+阅读 · 2022年10月18日

【ACL2020放榜!】事件抽取、关系抽取、NER、Few-Shot 相关论文整理

【ACL2020放榜!】事件抽取、关系抽取、NER、Few-Shot 相关论文整理

深度学习自然语言处理

18+阅读 · 2020年5月22日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

论文浅尝 | XQA：一个跨语言开放域问答数据集

论文浅尝 | XQA：一个跨语言开放域问答数据集

开放知识图谱

26+阅读 · 2019年9月11日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

【论文推荐】最新七篇视觉问答（VQA）相关论文—差别注意力机制、视觉问题推理、视觉对话、数据可视化、记忆增强网络、显式推理

【论文推荐】最新七篇视觉问答（VQA）相关论文—差别注意力机制、视觉问题推理、视觉对话、数据可视化、记忆增强网络、显式推理

专知

17+阅读 · 2018年4月19日

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

专知

37+阅读 · 2018年2月21日

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

专知

66+阅读 · 2018年1月31日

相关论文

Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset

Arxiv

0+阅读 · 2023年6月5日

Evaluating Inter-Bilingual Semantic Parsing for Indian Languages

Arxiv

0+阅读 · 2023年6月5日

Cross-Lingual Transfer Learning for Phrase Break Prediction with Multilingual Language Model

Arxiv

0+阅读 · 2023年6月5日

LIDA: A Tool for Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models

Arxiv

0+阅读 · 2023年6月2日

A Survey of Knowledge-Enhanced Text Generation

Arxiv

18+阅读 · 2020年10月9日

CoDEx: A Comprehensive Knowledge Graph Completion Benchmark

Arxiv

10+阅读 · 2020年10月6日

PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization

Arxiv

17+阅读 · 2020年6月2日

Learning Hierarchy-Aware Knowledge Graph Embeddings for Link Prediction

Arxiv

18+阅读 · 2019年12月25日

Enhanced Meta-Learning for Cross-lingual Named Entity Recognition with Minimal Resources

Arxiv

13+阅读 · 2019年11月14日

Text Generation from Knowledge Graphs with Graph Transformers

Arxiv

35+阅读 · 2019年4月4日

相关基金

二维封闭腔内有单极电荷注入时的流动不稳定性与传热研究

国家自然科学基金

0+阅读 · 2015年12月31日

向量组合学习框架下基于依存混合树的中文语义解析研究

国家自然科学基金

3+阅读 · 2014年12月31日

LincRNA-MALAT1抑制内质网应激在抗胰岛纤维化中的作用和机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

口腔树脂材料中纳米银自生成方法及其相关基础研究

国家自然科学基金

0+阅读 · 2014年12月31日

IRE1α在低切应力诱导血管内皮细胞功能障碍中的作用及其机制

国家自然科学基金

0+阅读 · 2013年12月31日

情感信息抽取的资源建设及关键技术研究

国家自然科学基金

1+阅读 · 2012年12月31日

跨语言信息检索中的机器翻译研究

国家自然科学基金

2+阅读 · 2011年12月31日

中文自动口语摘要技术研究

国家自然科学基金

1+阅读 · 2011年12月31日

藏文字符排序研究

国家自然科学基金

0+阅读 · 2009年12月31日

基于迁移学习的Web挖掘研究

国家自然科学基金

1+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员