BSTC:大规模中文-英文语音翻译数据集 (BSTC: A Large-Scale Chinese-English Speech Translation Dataset) - 专知论文

会员服务 ·

0

语音翻译 · Automator · 数据集 · 语音识别 · 自动语音识别 ·

2021 年 4 月 19 日

BSTC: A Large-Scale Chinese-English Speech Translation Dataset

翻译：BSTC:大规模中文-英文语音翻译数据集

Ruiqing Zhang,Xiyang Wang,Chuanqiang Zhang,Zhongjun He,Hua Wu,Zhi Li,Haifeng Wang,Ying Chen,Qinfei Li

from arxiv, 8 pages, 6 figures

This paper presents BSTC (Baidu Speech Translation Corpus), a large-scale Chinese-English speech translation dataset. This dataset is constructed based on a collection of licensed videos of talks or lectures, including about 68 hours of Mandarin data, their manual transcripts and translations into English, as well as automated transcripts by an automatic speech recognition (ASR) model. We have further asked three experienced interpreters to simultaneously interpret the testing talks in a mock conference setting. This corpus is expected to promote the research of automatic simultaneous translation as well as the development of practical systems. We have organized simultaneous translation tasks and used this corpus to evaluate automatic simultaneous translation systems.

翻译：本文介绍BSTC(Baidu Speaking Corpus),这是一个大型中文和英文语音翻译数据集,该数据集基于收集经许可的演讲或讲座录像,包括大约68小时的普通话数据、其人工记录誊本和英文译文,以及自动语音识别模式的自动记录誊本,我们进一步请三名有经验的口译员在模拟会议环境中同时对测试会谈进行口译,预期该文集将促进自动同步翻译的研究以及实用系统的开发,我们组织了同步翻译任务,并利用该文集评价自动同步翻译系统。

0

相关内容

语音翻译

通过计算机进行不同语言之间的直接语音翻译，辅助不同语言背景的人们进行沟通已经成为世界各国研究的重点。和一般的文本翻译不同，语音翻译需要把语音识别、机器翻译和语音合成三大技术进行集成，具有很大的挑战性。

自然语言处理顶会COLING2020最佳论文出炉！

自然语言处理顶会COLING2020最佳论文出炉！

专知会员服务

24+阅读 · 2020年12月12日

【Facebook AI】无监督机器翻译，336页ppt，Unsupervised Machine Translation

专知会员服务

19+阅读 · 2020年11月17日

【大规模机器学习】综述论文，20页pdf，A Survey on Large-scale Machine

【大规模机器学习】综述论文，20页pdf，A Survey on Large-scale Machine

专知会员服务

66+阅读 · 2020年8月13日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【CVPR2020-微软&FB】自监督学习的视觉语言建模，115页ppt讲述多模态预训练进展

【CVPR2020-微软&FB】自监督学习的视觉语言建模，115页ppt讲述多模态预训练进展

专知会员服务

59+阅读 · 2020年6月18日

商业数据分析，39页ppt

商业数据分析，39页ppt

专知会员服务

165+阅读 · 2020年6月2日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【Google】无监督机器翻译，Unsupervised Machine Translation

【Google】无监督机器翻译，Unsupervised Machine Translation

专知会员服务

36+阅读 · 2020年3月3日

【NLP| 推荐文章】语言语音处理（Speech and Language Processing(3rd ed.draft)）

专知会员服务

15+阅读 · 2019年11月24日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

自然语言处理NLP之旅（NLP文章/代码集锦）

自然语言处理NLP之旅（NLP文章/代码集锦）

专知

28+阅读 · 2019年8月6日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

文本数据分析（三）：用Python实现文本数据预处理

文本数据分析（三）：用Python实现文本数据预处理

论智

7+阅读 · 2018年4月12日

【论文推荐】最新5篇语音识别（ASR）相关论文—音频对抗样本、对抗性语音识别系统、声学模型、序列到序列、口语可理解性矫正

【论文推荐】最新5篇语音识别（ASR）相关论文—音频对抗样本、对抗性语音识别系统、声学模型、序列到序列、口语可理解性矫正

专知

14+阅读 · 2018年2月4日

carla 体验效果及代码

carla 体验效果及代码

CreateAMind

7+阅读 · 2018年2月3日

【专知荟萃14】机器翻译 Machine Translation知识资料全集（入门/进阶/综述/视频/代码/专家，附PDF下载）

【专知荟萃14】机器翻译 Machine Translation知识资料全集（入门/进阶/综述/视频/代码/专家，附PDF下载）

专知

3+阅读 · 2017年11月13日

【学习】(Python)SVM数据分类

【学习】(Python)SVM数据分类

机器学习研究会

6+阅读 · 2017年10月15日

【推荐】视频目标分割基础

【推荐】视频目标分割基础

机器学习研究会

9+阅读 · 2017年9月19日

自然语言处理（二）机器翻译篇 (NLP: machine translation)

自然语言处理（二）机器翻译篇 (NLP: machine translation)

DeepLearning中文论坛

12+阅读 · 2015年7月1日

NeurST: Neural Speech Translation Toolkit

Arxiv

0+阅读 · 2021年6月10日

Unsupervised Automatic Speech Recognition: A Review

Arxiv

0+阅读 · 2021年6月9日

Itihasa: A large-scale corpus for Sanskrit to English translation

Itihasa: A large-scale corpus for Sanskrit to English translation

Arxiv

0+阅读 · 2021年6月8日

Self-supervised and Supervised Joint Training for Resource-rich Machine Translation

Arxiv

0+阅读 · 2021年6月8日

ConvoSumm: Conversation Summarization Benchmark and Improved Abstractive Summarization with Argument Mining

Arxiv

0+阅读 · 2021年6月1日

Curriculum Pre-training for End-to-End Speech Translation

Arxiv

4+阅读 · 2020年4月21日

Pre-Training with Whole Word Masking for Chinese BERT

Arxiv

11+阅读 · 2019年6月19日

Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese

Arxiv

5+阅读 · 2018年6月4日

Word Translation Without Parallel Data

Arxiv

8+阅读 · 2018年1月30日

State-of-the-art Speech Recognition With Sequence-to-Sequence Models

Arxiv

7+阅读 · 2018年1月18日

VIP会员

文章信息

相关主题

自动语音识别

相关VIP内容

自然语言处理顶会COLING2020最佳论文出炉！

自然语言处理顶会COLING2020最佳论文出炉！

专知会员服务

24+阅读 · 2020年12月12日

【Facebook AI】无监督机器翻译，336页ppt，Unsupervised Machine Translation

专知会员服务

19+阅读 · 2020年11月17日

【大规模机器学习】综述论文，20页pdf，A Survey on Large-scale Machine

【大规模机器学习】综述论文，20页pdf，A Survey on Large-scale Machine

专知会员服务

66+阅读 · 2020年8月13日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【CVPR2020-微软&FB】自监督学习的视觉语言建模，115页ppt讲述多模态预训练进展

【CVPR2020-微软&FB】自监督学习的视觉语言建模，115页ppt讲述多模态预训练进展

专知会员服务

59+阅读 · 2020年6月18日

商业数据分析，39页ppt

商业数据分析，39页ppt

专知会员服务

165+阅读 · 2020年6月2日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【Google】无监督机器翻译，Unsupervised Machine Translation

【Google】无监督机器翻译，Unsupervised Machine Translation

专知会员服务

36+阅读 · 2020年3月3日

【NLP| 推荐文章】语言语音处理（Speech and Language Processing(3rd ed.draft)）

专知会员服务

15+阅读 · 2019年11月24日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

【ICCV2025教程】基础模型遇见具身智能体

军事机器学习设计：关于开发自动化任务摘要系统的梯次化设计科学研究 | 2025最新93页

扩散模型中的缓存方法综述：迈向高效的多模态生成

【ICCV2025教程】《迈向视觉语言模型的全面推理》

相关资讯

自然语言处理NLP之旅（NLP文章/代码集锦）

自然语言处理NLP之旅（NLP文章/代码集锦）

专知

28+阅读 · 2019年8月6日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

文本数据分析（三）：用Python实现文本数据预处理

文本数据分析（三）：用Python实现文本数据预处理

论智

7+阅读 · 2018年4月12日

【论文推荐】最新5篇语音识别（ASR）相关论文—音频对抗样本、对抗性语音识别系统、声学模型、序列到序列、口语可理解性矫正

【论文推荐】最新5篇语音识别（ASR）相关论文—音频对抗样本、对抗性语音识别系统、声学模型、序列到序列、口语可理解性矫正

专知

14+阅读 · 2018年2月4日

carla 体验效果及代码

carla 体验效果及代码

CreateAMind

7+阅读 · 2018年2月3日

【专知荟萃14】机器翻译 Machine Translation知识资料全集（入门/进阶/综述/视频/代码/专家，附PDF下载）

【专知荟萃14】机器翻译 Machine Translation知识资料全集（入门/进阶/综述/视频/代码/专家，附PDF下载）

专知

3+阅读 · 2017年11月13日

【学习】(Python)SVM数据分类

【学习】(Python)SVM数据分类

机器学习研究会

6+阅读 · 2017年10月15日

【推荐】视频目标分割基础

【推荐】视频目标分割基础

机器学习研究会

9+阅读 · 2017年9月19日

自然语言处理（二）机器翻译篇 (NLP: machine translation)

自然语言处理（二）机器翻译篇 (NLP: machine translation)

DeepLearning中文论坛

12+阅读 · 2015年7月1日

相关论文

NeurST: Neural Speech Translation Toolkit

Arxiv

0+阅读 · 2021年6月10日

Unsupervised Automatic Speech Recognition: A Review

Arxiv

0+阅读 · 2021年6月9日

Itihasa: A large-scale corpus for Sanskrit to English translation

Itihasa: A large-scale corpus for Sanskrit to English translation

Arxiv

0+阅读 · 2021年6月8日

Self-supervised and Supervised Joint Training for Resource-rich Machine Translation

Arxiv

0+阅读 · 2021年6月8日

ConvoSumm: Conversation Summarization Benchmark and Improved Abstractive Summarization with Argument Mining

Arxiv

0+阅读 · 2021年6月1日

Curriculum Pre-training for End-to-End Speech Translation

Arxiv

4+阅读 · 2020年4月21日

Pre-Training with Whole Word Masking for Chinese BERT

Arxiv

11+阅读 · 2019年6月19日

Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese

Arxiv

5+阅读 · 2018年6月4日

Word Translation Without Parallel Data

Arxiv

8+阅读 · 2018年1月30日

State-of-the-art Speech Recognition With Sequence-to-Sequence Models

Arxiv

7+阅读 · 2018年1月18日

微信扫码咨询专知VIP会员