爱丁堡国际英语口音语料库：走向英语ASR民主化 (The Edinburgh International Accents of English Corpus: Towards the Democratization of English ASR) - 专知论文

会员服务 ·

0

语音识别 · 第二语言 · 语料库 · 语料 · 数据集 ·

2023 年 3 月 31 日

The Edinburgh International Accents of English Corpus: Towards the Democratization of English ASR

翻译：爱丁堡国际英语口音语料库：走向英语ASR民主化

Ramon Sanabria,Nikolay Bogoychev,Nina Markl,Andrea Carmantini,Ondrej Klejch,Peter Bell

from arxiv, Accepted to IEEE ICASSP 2023

English is the most widely spoken language in the world, used daily by millions of people as a first or second language in many different contexts. As a result, there are many varieties of English. Although the great many advances in English automatic speech recognition (ASR) over the past decades, results are usually reported based on test datasets which fail to represent the diversity of English as spoken today around the globe. We present the first release of The Edinburgh International Accents of English Corpus (EdAcc). This dataset attempts to better represent the wide diversity of English, encompassing almost 40 hours of dyadic video call conversations between friends. Unlike other datasets, EdAcc includes a wide range of first and second-language varieties of English and a linguistic background profile of each speaker. Results on latest public, and commercial models show that EdAcc highlights shortcomings of current English ASR models. The best performing model, trained on 680 thousand hours of transcribed data, obtains an average of 19.7% word error rate (WER) -- in contrast to the 2.7% WER obtained when evaluated on US English clean read speech. Across all models, we observe a drop in performance on Indian, Jamaican, and Nigerian English speakers. Recordings, linguistic backgrounds, data statement, and evaluation scripts are released on our website (https://groups.inf.ed.ac.uk/edacc/) under CC-BY-SA license.

翻译：英语是世界上使用最广泛的语言，每天有数百万人以很多不同的环境中使用英语作为第一或第二语言。因此，英语有许多变体。尽管在过去的几十年中对英语自动语音识别（ASR）取得了很多进展，但通常基于测试数据集报告的结果还不能代表目前全球所使用的英语多样性。我们现在发布了爱丁堡国际英语口音语料库（EdAcc）的第一个版本。这个数据集试图更好地代表广泛多样化的英语，包括近40小时的朋友之间的双向视频通话对话。与其他数据集不同，EdAcc包括广泛的第一和第二语言英语变体以及每个说话者的语言背景资料。在最新的公共和商业模型中，结果表明EdAcc突出了目前英语ASR模型的缺点。训练了68万小时转录数据的最佳表现模型，在评估美式英语干净阅读语音时获得了平均19.7％的文字错误率（WER），这与在美式英语干净阅读语音上评估时获得的2.7％的WER形成对比。在所有模型中，我们发现印度，牙买加和尼日利亚的英语说话者的表现下降。相关录音，语言背景，数据说明和评估脚本已在我们的网站（https://groups.inf.ed.ac.uk/edacc/）上发布，授权许可为CC-BY-SA。

0

相关内容

语音识别

语音识别是计算机科学和计算语言学的一个跨学科子领域，它发展了一些方法和技术，使计算机可以将口语识别和翻译成文本。它也被称为自动语音识别（ASR），计算机语音识别或语音转文本（STT）。它整合了计算机科学，语言学和计算机工程领域的知识和研究。

【2022新书】高效深度学习，Efficient Deep Learning Book

【2022新书】高效深度学习，Efficient Deep Learning Book

专知会员服务

125+阅读 · 2022年4月21日

自然语言处理顶会EMNLP2021奖项公布，剑桥刘方宇、哥大杨子小帆一作论文分获最佳长、短论文奖

自然语言处理顶会EMNLP2021奖项公布，剑桥刘方宇、哥大杨子小帆一作论文分获最佳长、短论文奖

专知会员服务

14+阅读 · 2021年10月31日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

【ICML2020】文本摘要生成模型PEGASUS

【ICML2020】文本摘要生成模型PEGASUS

专知会员服务

35+阅读 · 2020年8月23日

最新《自然语言处理迁移学习》综述论文，A Survey on Transfer Learning in Natural Language Processing

最新《自然语言处理迁移学习》综述论文，A Survey on Transfer Learning in Natural Language Processing

专知会员服务

140+阅读 · 2020年7月10日

零样本文本分类，Zero-Shot Learning for Text Classification

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【Google论文强烈推荐】ALBERT:基于精简BERT的自我监督学习的语言表示，ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations

【Google论文强烈推荐】ALBERT:基于精简BERT的自我监督学习的语言表示，ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations

专知会员服务

24+阅读 · 2019年12月21日

【论文推荐】将机器语言模型扩展到人类级别的语言理解，Extending Machine Language Models toward Human-Level Language Understanding

【论文推荐】将机器语言模型扩展到人类级别的语言理解，Extending Machine Language Models toward Human-Level Language Understanding

专知会员服务

18+阅读 · 2019年12月14日

【Google论文】ALBERT:自我监督学习语言表达的精简BERT

【Google论文】ALBERT:自我监督学习语言表达的精简BERT

专知会员服务

24+阅读 · 2019年11月4日

学术会议 | 欢迎注册参加第11届国际知识图谱联合会议

学术会议 | 欢迎注册参加第11届国际知识图谱联合会议

开放知识图谱

0+阅读 · 2022年10月21日

征稿 | International Joint Conference on Knowledge Graphs (IJCKG)

征稿 | International Joint Conference on Knowledge Graphs (IJCKG)

开放知识图谱

2+阅读 · 2022年5月20日

计算机 | EMNLP 2019等国际会议信息6条

计算机 | EMNLP 2019等国际会议信息6条

Call4Papers

18+阅读 · 2019年4月26日

NLP 2018 Highlights：2018自然语言处理技术亮点汇总

NLP 2018 Highlights：2018自然语言处理技术亮点汇总

AINLP

10+阅读 · 2019年2月9日

大数据 | 顶级SCI期刊专刊/国际会议信息7条

大数据 | 顶级SCI期刊专刊/国际会议信息7条

Call4Papers

10+阅读 · 2018年12月29日

BERT 现已开源：最先进的 NLP 预训练技术，支持中文和更多语言

BERT 现已开源：最先进的 NLP 预训练技术，支持中文和更多语言

谷歌开发者

16+阅读 · 2018年11月6日

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

AINLP

12+阅读 · 2018年11月1日

【论文推荐】最新八篇情感分析相关论文—注意力网络、多模态情感分析、情感分析局限性、跨语言情感分类、多语言情感分析

【论文推荐】最新八篇情感分析相关论文—注意力网络、多模态情感分析、情感分析局限性、跨语言情感分类、多语言情感分析

专知

52+阅读 · 2018年6月28日

【论文推荐】最新五篇命名实体识别相关论文—深度主动学习、Lattice LSTM、混合马尔可夫CRF

【论文推荐】最新五篇命名实体识别相关论文—深度主动学习、Lattice LSTM、混合马尔可夫CRF

专知

26+阅读 · 2018年5月22日

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

专知

37+阅读 · 2018年2月21日

S3AGA样本（Spitzer-SDSS Spectral Atlas of Galaxies and AGNs)及其AGN研究

国家自然科学基金

0+阅读 · 2014年12月31日

中国产石竹科无心菜属（Arenaria）的分类学研究

国家自然科学基金

0+阅读 · 2014年12月31日

大规模汉语历时语料库建设及词汇语义变迁研究

国家自然科学基金

1+阅读 · 2014年12月31日

高温胁迫下可变剪切调控拟南芥miRNA400加工机理研究

国家自然科学基金

0+阅读 · 2013年12月31日

面向高维多目标优化问题的偏好信息启发下的协同进化算法研究

国家自然科学基金

0+阅读 · 2013年12月31日

海洋放线菌ACMA006抗肿瘤活性物质抑制肝癌的实验研究

国家自然科学基金

0+阅读 · 2013年12月31日

羊八井观测站大气不透明度的测量

国家自然科学基金

0+阅读 · 2013年12月31日

汉藏双语个性化多语种语音合成中的语言建模的研究

国家自然科学基金

0+阅读 · 2012年12月31日

组团参加国际光学联合会大会

国家自然科学基金

0+阅读 · 2012年8月18日

UGT基因簇进化及调控研究

国家自然科学基金

0+阅读 · 2009年12月31日

Chip-Chat: Challenges and Opportunities in Conversational Hardware Design

Chip-Chat: Challenges and Opportunities in Conversational Hardware Design

Arxiv

0+阅读 · 2023年5月22日

Complex Claim Verification with Evidence Retrieved in the Wild

Arxiv

0+阅读 · 2023年5月19日

HalOmi: A Manually Annotated Benchmark for Multilingual Hallucination and Omission Detection in Machine Translation

Arxiv

0+阅读 · 2023年5月19日

Bias Beyond English: Counterfactual Tests for Bias in Sentiment Analysis in Four Languages

Arxiv

0+阅读 · 2023年5月19日

Empowering Sentence Encoders with Prompting and Label Retrieval for Zero-shot Text Classification

Arxiv

0+阅读 · 2023年5月19日

Extending Memory for Language Modelling

Arxiv

0+阅读 · 2023年5月19日

A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks

Arxiv

0+阅读 · 2023年5月18日

Pre-training Methods in Information Retrieval

Arxiv

16+阅读 · 2021年11月27日

Affective Image Content Analysis: Two Decades Review and New Perspectives

Arxiv

16+阅读 · 2021年6月30日

Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking

Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking

Arxiv

10+阅读 · 2018年3月29日

VIP会员

文章信息

相关主题

相关VIP内容

【2022新书】高效深度学习，Efficient Deep Learning Book

【2022新书】高效深度学习，Efficient Deep Learning Book

专知会员服务

125+阅读 · 2022年4月21日

自然语言处理顶会EMNLP2021奖项公布，剑桥刘方宇、哥大杨子小帆一作论文分获最佳长、短论文奖

自然语言处理顶会EMNLP2021奖项公布，剑桥刘方宇、哥大杨子小帆一作论文分获最佳长、短论文奖

专知会员服务

14+阅读 · 2021年10月31日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

【ICML2020】文本摘要生成模型PEGASUS

【ICML2020】文本摘要生成模型PEGASUS

专知会员服务

35+阅读 · 2020年8月23日

最新《自然语言处理迁移学习》综述论文，A Survey on Transfer Learning in Natural Language Processing

最新《自然语言处理迁移学习》综述论文，A Survey on Transfer Learning in Natural Language Processing

专知会员服务

140+阅读 · 2020年7月10日

零样本文本分类，Zero-Shot Learning for Text Classification

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【Google论文强烈推荐】ALBERT:基于精简BERT的自我监督学习的语言表示，ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations

【Google论文强烈推荐】ALBERT:基于精简BERT的自我监督学习的语言表示，ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations

专知会员服务

24+阅读 · 2019年12月21日

【论文推荐】将机器语言模型扩展到人类级别的语言理解，Extending Machine Language Models toward Human-Level Language Understanding

【论文推荐】将机器语言模型扩展到人类级别的语言理解，Extending Machine Language Models toward Human-Level Language Understanding

专知会员服务

18+阅读 · 2019年12月14日

【Google论文】ALBERT:自我监督学习语言表达的精简BERT

【Google论文】ALBERT:自我监督学习语言表达的精简BERT

专知会员服务

24+阅读 · 2019年11月4日

热门VIP内容

开通专知VIP会员享更多权益服务

面向性能、成本效益、云边隐私与可信性的大小语言模型协作综述

乌克兰太空研究（2022-2024年） | 176页

【CMU博士论文】大型语言模型的隐性特性

国防领域人工智能走向何方？

相关资讯

学术会议 | 欢迎注册参加第11届国际知识图谱联合会议

学术会议 | 欢迎注册参加第11届国际知识图谱联合会议

开放知识图谱

0+阅读 · 2022年10月21日

征稿 | International Joint Conference on Knowledge Graphs (IJCKG)

征稿 | International Joint Conference on Knowledge Graphs (IJCKG)

开放知识图谱

2+阅读 · 2022年5月20日

计算机 | EMNLP 2019等国际会议信息6条

计算机 | EMNLP 2019等国际会议信息6条

Call4Papers

18+阅读 · 2019年4月26日

NLP 2018 Highlights：2018自然语言处理技术亮点汇总

NLP 2018 Highlights：2018自然语言处理技术亮点汇总

AINLP

10+阅读 · 2019年2月9日

大数据 | 顶级SCI期刊专刊/国际会议信息7条

大数据 | 顶级SCI期刊专刊/国际会议信息7条

Call4Papers

10+阅读 · 2018年12月29日

BERT 现已开源：最先进的 NLP 预训练技术，支持中文和更多语言

BERT 现已开源：最先进的 NLP 预训练技术，支持中文和更多语言

谷歌开发者

16+阅读 · 2018年11月6日

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

AINLP

12+阅读 · 2018年11月1日

【论文推荐】最新八篇情感分析相关论文—注意力网络、多模态情感分析、情感分析局限性、跨语言情感分类、多语言情感分析

【论文推荐】最新八篇情感分析相关论文—注意力网络、多模态情感分析、情感分析局限性、跨语言情感分类、多语言情感分析

专知

52+阅读 · 2018年6月28日

【论文推荐】最新五篇命名实体识别相关论文—深度主动学习、Lattice LSTM、混合马尔可夫CRF

【论文推荐】最新五篇命名实体识别相关论文—深度主动学习、Lattice LSTM、混合马尔可夫CRF

专知

26+阅读 · 2018年5月22日

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

专知

37+阅读 · 2018年2月21日

相关论文

Chip-Chat: Challenges and Opportunities in Conversational Hardware Design

Chip-Chat: Challenges and Opportunities in Conversational Hardware Design

Arxiv

0+阅读 · 2023年5月22日

Complex Claim Verification with Evidence Retrieved in the Wild

Arxiv

0+阅读 · 2023年5月19日

HalOmi: A Manually Annotated Benchmark for Multilingual Hallucination and Omission Detection in Machine Translation

Arxiv

0+阅读 · 2023年5月19日

Bias Beyond English: Counterfactual Tests for Bias in Sentiment Analysis in Four Languages

Arxiv

0+阅读 · 2023年5月19日

Empowering Sentence Encoders with Prompting and Label Retrieval for Zero-shot Text Classification

Arxiv

0+阅读 · 2023年5月19日

Extending Memory for Language Modelling

Arxiv

0+阅读 · 2023年5月19日

A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks

Arxiv

0+阅读 · 2023年5月18日

Pre-training Methods in Information Retrieval

Arxiv

16+阅读 · 2021年11月27日

Affective Image Content Analysis: Two Decades Review and New Perspectives

Arxiv

16+阅读 · 2021年6月30日

Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking

Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking

Arxiv

10+阅读 · 2018年3月29日

相关基金

S3AGA样本（Spitzer-SDSS Spectral Atlas of Galaxies and AGNs)及其AGN研究

国家自然科学基金

0+阅读 · 2014年12月31日

中国产石竹科无心菜属（Arenaria）的分类学研究

国家自然科学基金

0+阅读 · 2014年12月31日

大规模汉语历时语料库建设及词汇语义变迁研究

国家自然科学基金

1+阅读 · 2014年12月31日

高温胁迫下可变剪切调控拟南芥miRNA400加工机理研究

国家自然科学基金

0+阅读 · 2013年12月31日

面向高维多目标优化问题的偏好信息启发下的协同进化算法研究

国家自然科学基金

0+阅读 · 2013年12月31日

海洋放线菌ACMA006抗肿瘤活性物质抑制肝癌的实验研究

国家自然科学基金

0+阅读 · 2013年12月31日

羊八井观测站大气不透明度的测量

国家自然科学基金

0+阅读 · 2013年12月31日

汉藏双语个性化多语种语音合成中的语言建模的研究

国家自然科学基金

0+阅读 · 2012年12月31日

组团参加国际光学联合会大会

国家自然科学基金

0+阅读 · 2012年8月18日

UGT基因簇进化及调控研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员