SVTS: 可缩放的视频语音合成 (SVTS: Scalable Video-to-Speech Synthesis) - 专知论文

会员服务 ·

0

数据可用性 · 预测器/决策函数 · 知识 (knowledge) · 前馈 · 词表 ·

2022 年 5 月 4 日

SVTS: Scalable Video-to-Speech Synthesis

翻译：SVTS: 可缩放的视频语音合成

Rodrigo Mira,Alexandros Haliassos,Stavros Petridis,Björn W. Schuller,Maja Pantic

from arxiv, submitted to INTERSPEECH 2022

Video-to-speech synthesis (also known as lip-to-speech) refers to the translation of silent lip movements into the corresponding audio. This task has received an increasing amount of attention due to its self-supervised nature (i.e., can be trained without manual labelling) combined with the ever-growing collection of audio-visual data available online. Despite these strong motivations, contemporary video-to-speech works focus mainly on small- to medium-sized corpora with substantial constraints in both vocabulary and setting. In this work, we introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder, which converts the mel-frequency spectrograms into waveform audio. We achieve state-of-the art results for GRID and considerably outperform previous approaches on LRW. More importantly, by focusing on spectrogram prediction using a simple feedforward model, we can efficiently and effectively scale our method to very large and unconstrained datasets: To the best of our knowledge, we are the first to show intelligible results on the challenging LRS3 dataset.

翻译：视频到语音合成(又称“唇对口”)是指将静默嘴唇运动转换成相应的音频。由于这一任务具有自我监督的性质(即无需人工贴标签即可培训),加上不断增多的在线视听数据收集工作(即无需人工贴标签即可接受培训),因此受到越来越多的关注。尽管存在这些强烈的动机,但当代视频到语音合成工作主要侧重于在词汇和设置两方面都存在严重制约的中小型公司。在这项工作中,我们引入了一个可扩缩的视频到语音框架,由两个部分组成:一个视频到频谱预测器和一个预先培训的神经电动电码,将中频谱转换成波形音频。我们实现了全球资源数据库的艺术成果,大大超越了以往在LRW上的做法。更重要的是,通过使用简单的Feforward模型侧重于光谱预测,我们可以高效率和有效地将我们的方法推广到非常大和不受控制的数据设置:为了最有挑战性的数据,我们第一次展示的是具有挑战性的数据。

0

相关内容

数据可用性

数据可用性

剑桥大学《数据科学: 原理与实践》课程，附PPT下载

剑桥大学《数据科学: 原理与实践》课程，附PPT下载

专知会员服务

53+阅读 · 2021年1月20日

纽约大学最新《语音识别Speech Recognition》2020课程，不可错过！

纽约大学最新《语音识别Speech Recognition》2020课程，不可错过！

专知会员服务

44+阅读 · 2020年11月2日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

181+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

IEEE ICKG 2022: Call for Papers

IEEE ICKG 2022: Call for Papers

机器学习与推荐算法

3+阅读 · 2022年3月30日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

ACM TOMM Call for Papers

ACM TOMM Call for Papers

CCF多媒体专委会

2+阅读 · 2022年3月23日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Latest News & Announcements of the Tutorial

【ICIG2021】Latest News & Announcements of the Tutorial

中国图象图形学学会CSIG

3+阅读 · 2021年12月20日

【ICIG2021】Latest News & Announcements of the Workshop

【ICIG2021】Latest News & Announcements of the Workshop

中国图象图形学学会CSIG

0+阅读 · 2021年12月20日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

过渡金属催化炔烃串联环化三氟甲基/氟化反应研究

国家自然科学基金

0+阅读 · 2014年12月31日

TREM2基因在晚发型AD（LOAD）中介导Aβ吞噬与炎症调节的机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

HEVC的低复杂度和并行编码方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

SP调制的底部光栅长波/甚长波双色QWIP研究

国家自然科学基金

0+阅读 · 2013年12月31日

含三氟甲基有机化合物的多样性合成

国家自然科学基金

0+阅读 · 2012年12月31日

面向无线通信的3D视频感知编码及码率控制研究

国家自然科学基金

0+阅读 · 2012年12月31日

催化型氮杂Wittig反应合成多取代杂环的新方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

廉价过渡金属铁催化的有机串联反应及应用

国家自然科学基金

0+阅读 · 2009年12月31日

金属与有机小分子共催化合成几类环状化合物

国家自然科学基金

0+阅读 · 2009年12月31日

可重构多格式视频编解码系统结构研究

国家自然科学基金

0+阅读 · 2009年12月31日

AnimeSR: Learning Real-World Super-Resolution Models for Animation Videos

Arxiv

0+阅读 · 2022年6月21日

WOLONet: Wave Outlooker for Efficient and High Fidelity Speech Synthesis

Arxiv

0+阅读 · 2022年6月20日

SCaLa: Supervised Contrastive Learning for End-to-End Speech Recognition

Arxiv

0+阅读 · 2022年6月20日

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Arxiv

0+阅读 · 2022年6月16日

Graph Self-Supervised Learning: A Survey

Arxiv

15+阅读 · 2021年8月5日

Pretrained Transformers for Text Ranking: BERT and Beyond

Arxiv

28+阅读 · 2020年10月13日

Temporal Graph Networks for Deep Learning on Dynamic Graphs

Arxiv

37+阅读 · 2020年10月9日

Generative Adversarial Networks and Probabilistic Graph Models for Hyperspectral Image Classification

Arxiv

11+阅读 · 2018年2月10日

Conditional Random Field and Deep Feature Learning for Hyperspectral Image Segmentation

Arxiv

11+阅读 · 2017年12月27日

Zero-Shot Transfer Learning for Event Extraction

Arxiv

10+阅读 · 2017年7月4日

VIP会员

文章信息

相关主题

数据可用性

预测器/决策函数

知识 (knowledge)

相关VIP内容

剑桥大学《数据科学: 原理与实践》课程，附PPT下载

剑桥大学《数据科学: 原理与实践》课程，附PPT下载

专知会员服务

53+阅读 · 2021年1月20日

纽约大学最新《语音识别Speech Recognition》2020课程，不可错过！

纽约大学最新《语音识别Speech Recognition》2020课程，不可错过！

专知会员服务

44+阅读 · 2020年11月2日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

181+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

【博士论文】扩展可扩展会话推荐的边界

别想太多：高效 R1 风格大型推理模型综述

【ACMMM2025】EvoVLMA: 进化式视觉-语言模型自适应

智能体网络：用AI智能体编织下一代网络

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

IEEE ICKG 2022: Call for Papers

IEEE ICKG 2022: Call for Papers

机器学习与推荐算法

3+阅读 · 2022年3月30日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

ACM TOMM Call for Papers

ACM TOMM Call for Papers

CCF多媒体专委会

2+阅读 · 2022年3月23日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Latest News & Announcements of the Tutorial

【ICIG2021】Latest News & Announcements of the Tutorial

中国图象图形学学会CSIG

3+阅读 · 2021年12月20日

【ICIG2021】Latest News & Announcements of the Workshop

【ICIG2021】Latest News & Announcements of the Workshop

中国图象图形学学会CSIG

0+阅读 · 2021年12月20日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

相关论文

AnimeSR: Learning Real-World Super-Resolution Models for Animation Videos

Arxiv

0+阅读 · 2022年6月21日

WOLONet: Wave Outlooker for Efficient and High Fidelity Speech Synthesis

Arxiv

0+阅读 · 2022年6月20日

SCaLa: Supervised Contrastive Learning for End-to-End Speech Recognition

Arxiv

0+阅读 · 2022年6月20日

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Arxiv

0+阅读 · 2022年6月16日

Graph Self-Supervised Learning: A Survey

Arxiv

15+阅读 · 2021年8月5日

Pretrained Transformers for Text Ranking: BERT and Beyond

Arxiv

28+阅读 · 2020年10月13日

Temporal Graph Networks for Deep Learning on Dynamic Graphs

Arxiv

37+阅读 · 2020年10月9日

Generative Adversarial Networks and Probabilistic Graph Models for Hyperspectral Image Classification

Arxiv

11+阅读 · 2018年2月10日

Conditional Random Field and Deep Feature Learning for Hyperspectral Image Segmentation

Arxiv

11+阅读 · 2017年12月27日

Zero-Shot Transfer Learning for Event Extraction

Arxiv

10+阅读 · 2017年7月4日

相关基金

过渡金属催化炔烃串联环化三氟甲基/氟化反应研究

国家自然科学基金

0+阅读 · 2014年12月31日

TREM2基因在晚发型AD（LOAD）中介导Aβ吞噬与炎症调节的机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

HEVC的低复杂度和并行编码方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

SP调制的底部光栅长波/甚长波双色QWIP研究

国家自然科学基金

0+阅读 · 2013年12月31日

含三氟甲基有机化合物的多样性合成

国家自然科学基金

0+阅读 · 2012年12月31日

面向无线通信的3D视频感知编码及码率控制研究

国家自然科学基金

0+阅读 · 2012年12月31日

催化型氮杂Wittig反应合成多取代杂环的新方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

廉价过渡金属铁催化的有机串联反应及应用

国家自然科学基金

0+阅读 · 2009年12月31日

金属与有机小分子共催化合成几类环状化合物

国家自然科学基金

0+阅读 · 2009年12月31日

可重构多格式视频编解码系统结构研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员