AVA-AVD: 野生音像音频发言人 (AVA-AVD: Audio-visual Speaker Diarization in the Wild) - 专知论文

会员服务 ·

0

掩码 · 模态 · INFORMS · state-of-the-art · 数据集 ·

2021 年 11 月 29 日

AVA-AVD: Audio-visual Speaker Diarization in the Wild

翻译：AVA-AVD: 野生音像音频发言人

Eric Zhongcong Xu,Zeyang Song,Chao Feng,Mang Ye,Mike Zheng Shou

Audio-visual speaker diarization aims at detecting ``who spoken when`` using both auditory and visual signals. Existing audio-visual diarization datasets are mainly focused on indoor environments like meeting rooms or news studios, which are quite different from in-the-wild videos in many scenarios such as movies, documentaries, and audience sitcoms. To create a testbed that can effectively compare diarization methods on videos in the wild, we annotate the speaker diarization labels on the AVA movie dataset and create a new benchmark called AVA-AVD. This benchmark is challenging due to the diverse scenes, complicated acoustic conditions, and completely off-screen speakers. Yet, how to deal with off-screen and on-screen speakers together still remains a critical challenge. To overcome it, we propose a novel Audio-Visual Relation Network (AVR-Net) which introduces an effective modality mask to capture discriminative information based on visibility. Experiments have shown that our method not only can outperform state-of-the-art methods but also is more robust as varying the ratio of off-screen speakers. Ablation studies demonstrate the advantages of the proposed AVR-Net and especially the modality mask on diarization. Our data and code will be made publicly available.

翻译：视听发言人的夸度旨在检测在使用听觉和视觉信号时发言的人; 现有的视听二分化数据集主要侧重于室内环境,如会议室或新闻工作室,在电影、纪录片和观众喜剧等许多情景中与现场录像大不相同; 为了创建一个能够有效地比较野外录像的分化方法的试样床, 我们给AVA电影数据集上发言者的分化标签作笔记, 并创建一个称为AVA-AVD的新基准。这个基准具有挑战性,因为各种场景、复杂的声学条件和完全不在屏幕上的演讲者。然而,如何共同处理屏幕外和屏幕上的演讲者仍是一个关键的挑战。要克服这个挑战,我们建议建立一个新型的音频-视觉联系网络(AVR-Net), 引入一个有效的模式遮掩, 以通过可见度来捕捉歧视信息。实验显示, 我们的方法不仅可以超越最先进的状态方法,而且由于屏幕外演讲者的比例不同,这个基准也更加有力。我们的模拟研究将特别展示A-R 数据模式的优势。

0

相关内容

最新《Transformers模型》教程，64页ppt

最新《Transformers模型》教程，64页ppt

专知会员服务

321+阅读 · 2020年11月26日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【北邮-腾讯AI】自监督学习音视觉说话人认证，Self-supervised learning for audio-visual speaker diarization

【北邮-腾讯AI】自监督学习音视觉说话人认证，Self-supervised learning for audio-visual speaker diarization

专知会员服务

26+阅读 · 2020年2月16日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【Yoshua Bengio新论文】多任务自监督学习语音识别，MULTI-TASK SELF-SUPERVISED LEARNING FOR ROBUST SPEECH RECOGNITION

【Yoshua Bengio新论文】多任务自监督学习语音识别，MULTI-TASK SELF-SUPERVISED LEARNING FOR ROBUST SPEECH RECOGNITION

专知会员服务

39+阅读 · 2020年1月30日

【Google无监督大规模视觉表示迁移】Large Scale Learning of General Visual Representations for Transfer

【Google无监督大规模视觉表示迁移】Large Scale Learning of General Visual Representations for Transfer

专知会员服务

12+阅读 · 2020年1月7日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

已删除

将门创投

7+阅读 · 2020年3月13日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

The impact of removing head movements on audio-visual speech enhancement

Arxiv

0+阅读 · 2022年2月1日

Infrastructure-Based Object Detection and Tracking for Cooperative Driving Automation: A Survey

Arxiv

0+阅读 · 2022年1月28日

Hybrid Instance-aware Temporal Fusion for Online Video Instance Segmentation

Arxiv

9+阅读 · 2021年12月3日

Distilling Audio-Visual Knowledge by Compositional Contrastive Learning

Arxiv

5+阅读 · 2021年4月22日

Text Detection and Recognition in the Wild: A Review

Arxiv

20+阅读 · 2020年6月8日

End-to-End Multi-speaker Speech Recognition with Transformer

Arxiv

8+阅读 · 2020年2月13日

Advances in Online Audio-Visual Meeting Transcription

Advances in Online Audio-Visual Meeting Transcription

Arxiv

4+阅读 · 2019年12月10日

Video Person Re-identification by Temporal Residual Learning

Arxiv

5+阅读 · 2018年2月22日

Integrating both Visual and Audio Cues for Enhanced Video Caption

Arxiv

4+阅读 · 2017年12月9日

Detecting Curve Text in the Wild: New Dataset and New Solution

Arxiv

4+阅读 · 2017年12月6日

VIP会员

文章信息

相关主题

state-of-the-art

相关VIP内容

最新《Transformers模型》教程，64页ppt

最新《Transformers模型》教程，64页ppt

专知会员服务

321+阅读 · 2020年11月26日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【北邮-腾讯AI】自监督学习音视觉说话人认证，Self-supervised learning for audio-visual speaker diarization

【北邮-腾讯AI】自监督学习音视觉说话人认证，Self-supervised learning for audio-visual speaker diarization

专知会员服务

26+阅读 · 2020年2月16日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【Yoshua Bengio新论文】多任务自监督学习语音识别，MULTI-TASK SELF-SUPERVISED LEARNING FOR ROBUST SPEECH RECOGNITION

【Yoshua Bengio新论文】多任务自监督学习语音识别，MULTI-TASK SELF-SUPERVISED LEARNING FOR ROBUST SPEECH RECOGNITION

专知会员服务

39+阅读 · 2020年1月30日

【Google无监督大规模视觉表示迁移】Large Scale Learning of General Visual Representations for Transfer

【Google无监督大规模视觉表示迁移】Large Scale Learning of General Visual Representations for Transfer

专知会员服务

12+阅读 · 2020年1月7日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

【牛津博士论文】零样本强化学习综述

《美军条令：陆军指挥官与规划人员地理空间指南》60页

战术边缘指挥控制：防务面临的核心挑战

迈向开放世界检测：综述

相关资讯

已删除

将门创投

7+阅读 · 2020年3月13日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

相关论文

The impact of removing head movements on audio-visual speech enhancement

Arxiv

0+阅读 · 2022年2月1日

Infrastructure-Based Object Detection and Tracking for Cooperative Driving Automation: A Survey

Arxiv

0+阅读 · 2022年1月28日

Hybrid Instance-aware Temporal Fusion for Online Video Instance Segmentation

Arxiv

9+阅读 · 2021年12月3日

Distilling Audio-Visual Knowledge by Compositional Contrastive Learning

Arxiv

5+阅读 · 2021年4月22日

Text Detection and Recognition in the Wild: A Review

Arxiv

20+阅读 · 2020年6月8日

End-to-End Multi-speaker Speech Recognition with Transformer

Arxiv

8+阅读 · 2020年2月13日

Advances in Online Audio-Visual Meeting Transcription

Advances in Online Audio-Visual Meeting Transcription

Arxiv

4+阅读 · 2019年12月10日

Video Person Re-identification by Temporal Residual Learning

Arxiv

5+阅读 · 2018年2月22日

Integrating both Visual and Audio Cues for Enhanced Video Caption

Arxiv

4+阅读 · 2017年12月9日

Detecting Curve Text in the Wild: New Dataset and New Solution

Arxiv

4+阅读 · 2017年12月6日

微信扫码咨询专知VIP会员