以变换器为基础的视频视频单一和多人视频的视听语音语音语音识别前端 (Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Multi-Person Video) - 专知论文

会员服务 ·

0

语音识别 · 卷积 · Performer · Networking · MoDELS ·

2022 年 10 月 31 日

Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Multi-Person Video

翻译：以变换器为基础的视频视频单一和多人视频的视听语音语音语音识别前端

Dmitriy Serdyuk,Otavio Braga,Olivier Siohan

from arxiv, 5 pages, 3 figures, published at Interspeech 2022

Audio-visual automatic speech recognition (AV-ASR) extends speech recognition by introducing the video modality as an additional source of information. In this work, the information contained in the motion of the speaker's mouth is used to augment the audio features. The video modality is traditionally processed with a 3D convolutional neural network (e.g. 3D version of VGG). Recently, image transformer networks arXiv:2010.11929 demonstrated the ability to extract rich visual features for image classification tasks. Here, we propose to replace the 3D convolution with a video transformer to extract visual features. We train our baselines and the proposed model on a large scale corpus of YouTube videos. The performance of our approach is evaluated on a labeled subset of YouTube videos as well as on the LRS3-TED public corpus. Our best video-only model obtains 31.4% WER on YTDEV18 and 17.0% on LRS3-TED, a 10% and 15% relative improvements over our convolutional baseline. We achieve the state of the art performance of the audio-visual recognition on the LRS3-TED after fine-tuning our model (1.6% WER). In addition, in a series of experiments on multi-person AV-ASR, we obtained an average relative reduction of 2% over our convolutional video frontend.

翻译：视听自动语音识别(AV-ASR)通过引入视频模式作为额外的信息来源来扩展语音识别(AV-ASR),通过引入视频模式来扩展语音识别(AV-ASR),通过引入视频模式作为额外的信息来源。在这项工作中,发言者嘴部运动中所含的信息被用于增强音频功能。视频模式传统上由3D演动神经网络(如3D版VGG)处理。最近,图像变压器网络 arXiv:2010.11929 展示了为图像分类任务提取丰富的视觉特征的能力。在这里,我们提议用视频变压器取代3D演动器,以提取视觉特征。我们用一个大型的YouTube视频材料来培训我们的基线和拟议模型。我们方法的性能通过一个贴标签的YouTube视频集和LRS3-TED公共版来评估。我们最好的视频模型在YTDV18上获得了31.4%的WER,在LRS3-TED上获得了17.0%的图像特征特征特征。我们实现了在A3-R3调控动后对A2ER进行精化前实验的视听升级的视听系列的艺术识别的状态。

0

相关内容

语音识别

语音识别是计算机科学和计算语言学的一个跨学科子领域，它发展了一些方法和技术，使计算机可以将口语识别和翻译成文本。它也被称为自动语音识别（ASR），计算机语音识别或语音转文本（STT）。它整合了计算机科学，语言学和计算机工程领域的知识和研究。

【CVPR 2022】基于时空解耦与重耦的RGB-D动作识别 Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based Motion Recognition

【CVPR 2022】基于时空解耦与重耦的RGB-D动作识别 Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based Motion Recognition

专知会员服务

14+阅读 · 2022年3月19日

【CVPR 2022】基于粗粒度和细粒度特征匹配的视频描述评估，EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching

【CVPR 2022】基于粗粒度和细粒度特征匹配的视频描述评估，EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching

专知会员服务

10+阅读 · 2022年3月19日

【CVPR 2022】基于Tracklet查询和建议的高效视频实例分割，Efficient Video Instance Segmentation via Tracklet Query and Proposal

【CVPR 2022】基于Tracklet查询和建议的高效视频实例分割，Efficient Video Instance Segmentation via Tracklet Query and Proposal

专知会员服务

16+阅读 · 2022年3月3日

计算机科学课程与视频课件合集，Computer Science courses with video lectures

计算机科学课程与视频课件合集，Computer Science courses with video lectures

专知会员服务

37+阅读 · 2022年1月24日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

【论文推荐】小样本视频合成，Few-shot Video-to-Video Synthesis

【论文推荐】小样本视频合成，Few-shot Video-to-Video Synthesis

专知会员服务

24+阅读 · 2019年12月15日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

IEEE ICKG 2022: Call for Papers

IEEE ICKG 2022: Call for Papers

机器学习与推荐算法

3+阅读 · 2022年3月30日

ACM TOMM Call for Papers

ACM TOMM Call for Papers

CCF多媒体专委会

2+阅读 · 2022年3月23日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

长链非编码RNA-TUSC7在胃癌中的抑癌作用及机制研究

国家自然科学基金

1+阅读 · 2014年12月31日

新型中红外量子点级联激光器

国家自然科学基金

0+阅读 · 2013年12月31日

Par-4在hTERT非端粒酶活性依赖抗凋亡中的作用

国家自然科学基金

0+阅读 · 2012年12月31日

ST2蛋白抑制胃癌腹膜转移机制的研究

国家自然科学基金

0+阅读 · 2012年12月31日

CREG抑制心肌细胞自噬死亡和纤维化发生的溶酶体调控机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

战略环境评价（SEA）中碳排放核算的理论框架与技术方法体系- - 以城市尺度国民经济与社会发展规划SEA为例

国家自然科学基金

0+阅读 · 2012年12月31日

京津唐城市群大气颗粒物的微观形貌、元素组成及毒理学研究

国家自然科学基金

0+阅读 · 2011年12月31日

有机浮栅薄膜晶体管存储器的研制

国家自然科学基金

0+阅读 · 2011年12月31日

异常检测的方法研究及其在图像检索中的应用

国家自然科学基金

0+阅读 · 2009年12月31日

生物催化氧化还原途径中的遗传多样性研究

国家自然科学基金

0+阅读 · 2009年12月31日

MoQuad: Motion-focused Quadruple Construction for Video Contrastive Learning

Arxiv

0+阅读 · 2022年12月21日

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

Arxiv

0+阅读 · 2022年12月19日

Refined Semantic Enhancement towards Frequency Diffusion for Video Captioning

Arxiv

1+阅读 · 2022年12月18日

SceneGATE: Scene-Graph based co-Attention networks for TExt visual question answering

Arxiv

0+阅读 · 2022年12月16日

Dense Contrastive Learning for Self-Supervised Visual Pre-Training

Arxiv

18+阅读 · 2021年4月4日

Counterfactual Zero-Shot and Open-Set Visual Recognition

Arxiv

12+阅读 · 2021年3月1日

MVFNet: Multi-View Fusion Network for Efficient Video Recognition

Arxiv

13+阅读 · 2021年1月5日

Multi-view Knowledge Graph Embedding for Entity Alignment

Arxiv

36+阅读 · 2019年6月6日

SlowFast Networks for Video Recognition

SlowFast Networks for Video Recognition

Arxiv

19+阅读 · 2018年12月10日

Video Captioning via Hierarchical Reinforcement Learning

Arxiv

20+阅读 · 2018年3月29日

VIP会员

文章信息

相关主题

相关VIP内容

【CVPR 2022】基于时空解耦与重耦的RGB-D动作识别 Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based Motion Recognition

【CVPR 2022】基于时空解耦与重耦的RGB-D动作识别 Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based Motion Recognition

专知会员服务

14+阅读 · 2022年3月19日

【CVPR 2022】基于粗粒度和细粒度特征匹配的视频描述评估，EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching

【CVPR 2022】基于粗粒度和细粒度特征匹配的视频描述评估，EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching

专知会员服务

10+阅读 · 2022年3月19日

【CVPR 2022】基于Tracklet查询和建议的高效视频实例分割，Efficient Video Instance Segmentation via Tracklet Query and Proposal

【CVPR 2022】基于Tracklet查询和建议的高效视频实例分割，Efficient Video Instance Segmentation via Tracklet Query and Proposal

专知会员服务

16+阅读 · 2022年3月3日

计算机科学课程与视频课件合集，Computer Science courses with video lectures

计算机科学课程与视频课件合集，Computer Science courses with video lectures

专知会员服务

37+阅读 · 2022年1月24日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

【论文推荐】小样本视频合成，Few-shot Video-to-Video Synthesis

【论文推荐】小样本视频合成，Few-shot Video-to-Video Synthesis

专知会员服务

24+阅读 · 2019年12月15日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

【CMU博士论文】以人为中心的强化学习

任务规划与地形分析：现代复杂环境作战导航体系

认知优势：人工智能在国家安全决策中的核心作用

大模型赋能的具身智能：决策与具身学习综述

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

IEEE ICKG 2022: Call for Papers

IEEE ICKG 2022: Call for Papers

机器学习与推荐算法

3+阅读 · 2022年3月30日

ACM TOMM Call for Papers

ACM TOMM Call for Papers

CCF多媒体专委会

2+阅读 · 2022年3月23日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

相关论文

MoQuad: Motion-focused Quadruple Construction for Video Contrastive Learning

Arxiv

0+阅读 · 2022年12月21日

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

Arxiv

0+阅读 · 2022年12月19日

Refined Semantic Enhancement towards Frequency Diffusion for Video Captioning

Arxiv

1+阅读 · 2022年12月18日

SceneGATE: Scene-Graph based co-Attention networks for TExt visual question answering

Arxiv

0+阅读 · 2022年12月16日

Dense Contrastive Learning for Self-Supervised Visual Pre-Training

Arxiv

18+阅读 · 2021年4月4日

Counterfactual Zero-Shot and Open-Set Visual Recognition

Arxiv

12+阅读 · 2021年3月1日

MVFNet: Multi-View Fusion Network for Efficient Video Recognition

Arxiv

13+阅读 · 2021年1月5日

Multi-view Knowledge Graph Embedding for Entity Alignment

Arxiv

36+阅读 · 2019年6月6日

SlowFast Networks for Video Recognition

SlowFast Networks for Video Recognition

Arxiv

19+阅读 · 2018年12月10日

Video Captioning via Hierarchical Reinforcement Learning

Arxiv

20+阅读 · 2018年3月29日

相关基金

长链非编码RNA-TUSC7在胃癌中的抑癌作用及机制研究

国家自然科学基金

1+阅读 · 2014年12月31日

新型中红外量子点级联激光器

国家自然科学基金

0+阅读 · 2013年12月31日

Par-4在hTERT非端粒酶活性依赖抗凋亡中的作用

国家自然科学基金

0+阅读 · 2012年12月31日

ST2蛋白抑制胃癌腹膜转移机制的研究

国家自然科学基金

0+阅读 · 2012年12月31日

CREG抑制心肌细胞自噬死亡和纤维化发生的溶酶体调控机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

战略环境评价（SEA）中碳排放核算的理论框架与技术方法体系- - 以城市尺度国民经济与社会发展规划SEA为例

国家自然科学基金

0+阅读 · 2012年12月31日

京津唐城市群大气颗粒物的微观形貌、元素组成及毒理学研究

国家自然科学基金

0+阅读 · 2011年12月31日

有机浮栅薄膜晶体管存储器的研制

国家自然科学基金

0+阅读 · 2011年12月31日

异常检测的方法研究及其在图像检索中的应用

国家自然科学基金

0+阅读 · 2009年12月31日

生物催化氧化还原途径中的遗传多样性研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员