eBible 语料库：面向低资源语言的圣经翻译数据和模型基准 (The eBible Corpus: Data and Model Benchmarks for Bible Translation for Low-Resource Languages) - 专知论文

会员服务 ·

0

低资源 · 语料 · 基准 · 语料库 · NMT ·

2023 年 4 月 19 日

The eBible Corpus: Data and Model Benchmarks for Bible Translation for Low-Resource Languages

翻译：eBible 语料库：面向低资源语言的圣经翻译数据和模型基准

Vesa Akerman,David Baines,Damien Daspit,Ulf Hermjakob,Taeho Jang,Colin Leong,Michael Martin,Joel Mathew,Jonathan Robie,Marcus Schwarting

Efficiently and accurately translating a corpus into a low-resource language remains a challenge, regardless of the strategies employed, whether manual, automated, or a combination of the two. Many Christian organizations are dedicated to the task of translating the Holy Bible into languages that lack a modern translation. Bible translation (BT) work is currently underway for over 3000 extremely low resource languages. We introduce the eBible corpus: a dataset containing 1009 translations of portions of the Bible with data in 833 different languages across 75 language families. In addition to a BT benchmarking dataset, we introduce model performance benchmarks built on the No Language Left Behind (NLLB) neural machine translation (NMT) models. Finally, we describe several problems specific to the domain of BT and consider how the established data and model benchmarks might be used for future translation efforts. For a BT task trained with NLLB, Austronesian and Trans-New Guinea language families achieve 35.1 and 31.6 BLEU scores respectively, which spurs future innovations for NMT for low-resource languages in Papua New Guinea.

翻译：无论采用手动、自动或两者结合的策略，高效准确地将语料库翻译成低资源语言仍然是一个挑战。许多基督教组织致力于将圣经翻译成缺乏现代翻译的语言。目前正在进行超过 3000 种极低资源语言的翻译工作。我们介绍 eBible 语料：一个包含 1009 份圣经部分翻译的数据集，其中包含来自 75 种语言系的 833 种不同语言的数据。除了 BT 基准数据集，我们还引入了基于“不留一种语言落后”的神经机器翻译（NMT）模型的模型性能基准。最后，我们描述了圣经翻译领域中特定的一些问题，并考虑了已建立的数据和模型基准如何用于未来的翻译工作。对于使用 NLLB 进行 BT 任务的训练，以琉球语系和新几内亚翻译语系分别达到 35.1 和 31.6 BLEU 分数，这促进了巴布亚新几内亚低资源语言的 NMT 的未来创新。

0

相关内容

低资源

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

104+阅读 · 2022年2月10日

【WWW 2020 】基于关系对抗网络的低资源知识图谱补全，Relation Adversarial Network for Low Resource Knowledge Graph Completion

【WWW 2020 】基于关系对抗网络的低资源知识图谱补全，Relation Adversarial Network for Low Resource Knowledge Graph Completion

专知会员服务

37+阅读 · 2020年6月7日

【ACL2020】Span-ConveRT：预训练对话表示小样本跨度提取，Span-ConveRT: Few-shot Span Extraction for Dialog with Pretrained Conversational Representations

【ACL2020】Span-ConveRT：预训练对话表示小样本跨度提取，Span-ConveRT: Few-shot Span Extraction for Dialog with Pretrained Conversational Representations

专知会员服务

17+阅读 · 2020年5月19日

多语言神经机器翻译综述论文，34页pdf，A Comprehensive Survey of Multilingual Neural Machine Translation

多语言神经机器翻译综述论文，34页pdf，A Comprehensive Survey of Multilingual Neural Machine Translation

专知会员服务

19+阅读 · 2020年4月25日

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

专知会员服务

27+阅读 · 2020年4月5日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【ICLR2020】理解非自回归机器翻译中的知识蒸馏（Understanding Knowledge Distillation in Non-autoregressive Machine Translation）

【ICLR2020】理解非自回归机器翻译中的知识蒸馏（Understanding Knowledge Distillation in Non-autoregressive Machine Translation）

专知会员服务

11+阅读 · 2019年12月28日

【NLP| 推荐文章】从统一文本到文本探讨迁移学习的局限性（Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer）

【NLP| 推荐文章】从统一文本到文本探讨迁移学习的局限性（Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer）

专知会员服务

20+阅读 · 2019年11月24日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Multi-Task Learning的几篇综述文章

Multi-Task Learning的几篇综述文章

深度学习自然语言处理

15+阅读 · 2020年6月15日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

最新NLP论文阅读列表，包括对话、问答、摘要、翻译等（附资源）

最新NLP论文阅读列表，包括对话、问答、摘要、翻译等（附资源）

THU数据派

11+阅读 · 2019年3月25日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

专知

37+阅读 · 2018年2月21日

【数据集】新的YELP数据集官方下载

【数据集】新的YELP数据集官方下载

机器学习研究会

16+阅读 · 2017年8月31日

【推荐】图像分类必读开创性论文汇总

【推荐】图像分类必读开创性论文汇总

机器学习研究会

14+阅读 · 2017年8月15日

基于单语语料的无监督统计机器翻译模型研究

国家自然科学基金

1+阅读 · 2013年12月31日

第二语言学习对母语加工神经机制的影响研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于主干成分的句法统计机器翻译模型研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于类别非平衡时序增量数据批的多SVM动态集成企业信用评估建模

国家自然科学基金

1+阅读 · 2012年12月31日

基于多源信息融合的元数据自动抽取方法研究

国家自然科学基金

2+阅读 · 2012年12月31日

基于语言理解的机器翻译译文自动评价方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

自吞噬在骨性关节炎中抗软骨细胞凋亡作用机制的研究

国家自然科学基金

0+阅读 · 2011年12月31日

B细胞刺激因子受体BAFF-R、BCMA和TACI介导信号和相互关系在胶原性关节炎发病中作用及受体抑制剂对其的影响

国家自然科学基金

0+阅读 · 2011年12月31日

基于多重关系约束的地理空间数据自协调综合模型与方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于本体的深层网络数据集成方法研究

国家自然科学基金

2+阅读 · 2009年12月31日

MultiLegalPile: A 689GB Multilingual Legal Corpus

Arxiv

0+阅读 · 2023年6月6日

A Technical Report for Polyglot-Ko: Open-Source Large-Scale Korean Language Models

Arxiv

0+阅读 · 2023年6月6日

NarrowBERT: Accelerating Masked Language Model Pretraining and Inference

Arxiv

0+阅读 · 2023年6月5日

CTRL: Connect Tabular and Language Model for CTR Prediction

Arxiv

0+阅读 · 2023年6月5日

Tenrec: A Large-scale Multipurpose Benchmark Dataset for Recommender Systems

Arxiv

0+阅读 · 2023年6月4日

BertNet: Harvesting Knowledge Graphs with Arbitrary Relations from Pretrained Language Models

Arxiv

0+阅读 · 2023年6月2日

How Ready are Pre-trained Abstractive Models and LLMs for Legal Case Judgement Summarization?

Arxiv

0+阅读 · 2023年6月2日

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Arxiv

0+阅读 · 2023年6月1日

Benchmarks for Automated Commonsense Reasoning: A Survey

Arxiv

44+阅读 · 2023年2月22日

PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization

Arxiv

17+阅读 · 2020年6月2日

VIP会员

文章信息

相关主题

相关VIP内容

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

104+阅读 · 2022年2月10日

【WWW 2020 】基于关系对抗网络的低资源知识图谱补全，Relation Adversarial Network for Low Resource Knowledge Graph Completion

【WWW 2020 】基于关系对抗网络的低资源知识图谱补全，Relation Adversarial Network for Low Resource Knowledge Graph Completion

专知会员服务

37+阅读 · 2020年6月7日

【ACL2020】Span-ConveRT：预训练对话表示小样本跨度提取，Span-ConveRT: Few-shot Span Extraction for Dialog with Pretrained Conversational Representations

【ACL2020】Span-ConveRT：预训练对话表示小样本跨度提取，Span-ConveRT: Few-shot Span Extraction for Dialog with Pretrained Conversational Representations

专知会员服务

17+阅读 · 2020年5月19日

多语言神经机器翻译综述论文，34页pdf，A Comprehensive Survey of Multilingual Neural Machine Translation

多语言神经机器翻译综述论文，34页pdf，A Comprehensive Survey of Multilingual Neural Machine Translation

专知会员服务

19+阅读 · 2020年4月25日

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

专知会员服务

27+阅读 · 2020年4月5日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【ICLR2020】理解非自回归机器翻译中的知识蒸馏（Understanding Knowledge Distillation in Non-autoregressive Machine Translation）

【ICLR2020】理解非自回归机器翻译中的知识蒸馏（Understanding Knowledge Distillation in Non-autoregressive Machine Translation）

专知会员服务

11+阅读 · 2019年12月28日

【NLP| 推荐文章】从统一文本到文本探讨迁移学习的局限性（Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer）

【NLP| 推荐文章】从统一文本到文本探讨迁移学习的局限性（Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer）

专知会员服务

20+阅读 · 2019年11月24日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

《复杂工程系统模型驱动设计决策支持系统：早期设计阶段挑战》最新138页

《日本陆上自卫队2040年作战方式与未来作战研究》最新23页slides

人工智能作为战争武器

《后勤保障》最新23页

相关资讯

Multi-Task Learning的几篇综述文章

Multi-Task Learning的几篇综述文章

深度学习自然语言处理

15+阅读 · 2020年6月15日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

最新NLP论文阅读列表，包括对话、问答、摘要、翻译等（附资源）

最新NLP论文阅读列表，包括对话、问答、摘要、翻译等（附资源）

THU数据派

11+阅读 · 2019年3月25日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

专知

37+阅读 · 2018年2月21日

【数据集】新的YELP数据集官方下载

【数据集】新的YELP数据集官方下载

机器学习研究会

16+阅读 · 2017年8月31日

【推荐】图像分类必读开创性论文汇总

【推荐】图像分类必读开创性论文汇总

机器学习研究会

14+阅读 · 2017年8月15日

相关论文

MultiLegalPile: A 689GB Multilingual Legal Corpus

Arxiv

0+阅读 · 2023年6月6日

A Technical Report for Polyglot-Ko: Open-Source Large-Scale Korean Language Models

Arxiv

0+阅读 · 2023年6月6日

NarrowBERT: Accelerating Masked Language Model Pretraining and Inference

Arxiv

0+阅读 · 2023年6月5日

CTRL: Connect Tabular and Language Model for CTR Prediction

Arxiv

0+阅读 · 2023年6月5日

Tenrec: A Large-scale Multipurpose Benchmark Dataset for Recommender Systems

Arxiv

0+阅读 · 2023年6月4日

BertNet: Harvesting Knowledge Graphs with Arbitrary Relations from Pretrained Language Models

Arxiv

0+阅读 · 2023年6月2日

How Ready are Pre-trained Abstractive Models and LLMs for Legal Case Judgement Summarization?

Arxiv

0+阅读 · 2023年6月2日

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Arxiv

0+阅读 · 2023年6月1日

Benchmarks for Automated Commonsense Reasoning: A Survey

Arxiv

44+阅读 · 2023年2月22日

PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization

Arxiv

17+阅读 · 2020年6月2日

相关基金

基于单语语料的无监督统计机器翻译模型研究

国家自然科学基金

1+阅读 · 2013年12月31日

第二语言学习对母语加工神经机制的影响研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于主干成分的句法统计机器翻译模型研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于类别非平衡时序增量数据批的多SVM动态集成企业信用评估建模

国家自然科学基金

1+阅读 · 2012年12月31日

基于多源信息融合的元数据自动抽取方法研究

国家自然科学基金

2+阅读 · 2012年12月31日

基于语言理解的机器翻译译文自动评价方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

自吞噬在骨性关节炎中抗软骨细胞凋亡作用机制的研究

国家自然科学基金

0+阅读 · 2011年12月31日

B细胞刺激因子受体BAFF-R、BCMA和TACI介导信号和相互关系在胶原性关节炎发病中作用及受体抑制剂对其的影响

国家自然科学基金

0+阅读 · 2011年12月31日

基于多重关系约束的地理空间数据自协调综合模型与方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于本体的深层网络数据集成方法研究

国家自然科学基金

2+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员