AST: 音频分光变换器 (AST: Audio Spectrogram Transformer) - 专知论文

会员服务 ·

0

模型评估 · 变换 · Neural Networks · 卷积神经网络 · MoDELS ·

2021 年 4 月 6 日

AST: Audio Spectrogram Transformer

翻译：AST: 音频分光变换器

Yuan Gong,Yu-An Chung,James Glass

In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To better capture long-range global context, a recent trend is to add a self-attention mechanism on top of the CNN, forming a CNN-attention hybrid model. However, it is unclear whether the reliance on a CNN is necessary, and if neural networks purely based on attention are sufficient to obtain good performance in audio classification. In this paper, we answer the question by introducing the Audio Spectrogram Transformer (AST), the first convolution-free, purely attention-based model for audio classification. We evaluate AST on various audio classification benchmarks, where it achieves new state-of-the-art results of 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50, and 98.1% accuracy on Speech Commands V2.

翻译：在过去十年中,革命神经网络(CNNs)被广泛采用,作为终端到终端音频分类模型的主要基石,目的是学习从音频光谱到相应标签的直接绘图。为了更好地捕捉长距离全球背景,最近的趋势是在CNN上添加一个自留机制,形成CNN关注的混合模式。然而,尚不清楚是否有必要依赖CNN,以及纯粹基于关注的神经网络是否足以在音频分类中获得良好性能。在本文中,我们通过引入音频谱变换器(AST)来回答这一问题,这是第一个无声频变换器(AST),这是第一个无音频变换、纯粹基于关注的音频分类模式。我们根据各种音频分类基准对AST进行了评估,在其中,它实现了关于音频Set的0.485 mAP、关于ESC-50的95.6%的准确度和关于语音指令V2的98.1%的准确度。

1

相关内容

模型评估

机器学习系统设计系统评估标准

【干货书】Python程序员编程，810页pdf，Python® for Programmers

【干货书】Python程序员编程，810页pdf，Python® for Programmers

专知会员服务

61+阅读 · 2020年8月6日

【Google】多模态Transformer视频检索，Multi-modal Transformer

【Google】多模态Transformer视频检索，Multi-modal Transformer

专知会员服务

103+阅读 · 2020年7月22日

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

【CVPR2020-小鹏汽车】判别性多模态语音识别, Discriminative Multi-modality SR

【CVPR2020-小鹏汽车】判别性多模态语音识别, Discriminative Multi-modality SR

专知会员服务

41+阅读 · 2020年5月13日

【InterSpeech2020】混合语音识别系统中的词汇扩展技术，Techniques for Vocabulary Expansion in Hybrid Speech Recognition Systems

【InterSpeech2020】混合语音识别系统中的词汇扩展技术，Techniques for Vocabulary Expansion in Hybrid Speech Recognition Systems

专知会员服务

17+阅读 · 2020年3月23日

【微软雷德蒙研究院】小样本自然语言生成，Few-shot Natural Language Generation for Task-Oriented Dialog

【微软雷德蒙研究院】小样本自然语言生成，Few-shot Natural Language Generation for Task-Oriented Dialog

专知会员服务

33+阅读 · 2020年2月29日

【上海交大-ICASSP2020】Transformer端到端的多说话人语音识别

【上海交大-ICASSP2020】Transformer端到端的多说话人语音识别

专知会员服务

51+阅读 · 2020年2月16日

Transformer文本分类代码

Transformer文本分类代码

专知会员服务

118+阅读 · 2020年2月3日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【资源】语音增强资源集锦

【资源】语音增强资源集锦

专知

8+阅读 · 2020年7月4日

Interspeech 2019 | 从顶会看语音技术的发展趋势

Interspeech 2019 | 从顶会看语音技术的发展趋势

AI科技评论

16+阅读 · 2019年9月19日

Interspeech 2019 | 阿里达摩院语音实验室：联合CTC和Transformer的自动中文纠错模型

Interspeech 2019 | 阿里达摩院语音实验室：联合CTC和Transformer的自动中文纠错模型

机器之心

26+阅读 · 2019年9月15日

已删除

将门创投

4+阅读 · 2018年6月12日

隔离型双向DC/DC变换器：拓扑结构与控制方法

隔离型双向DC/DC变换器：拓扑结构与控制方法

NE电气

4+阅读 · 2018年4月24日

SBAT: Video Captioning with Sparse Boundary-Aware Transformer

Arxiv

4+阅读 · 2020年7月23日

End-to-End Multi-speaker Speech Recognition with Transformer

Arxiv

8+阅读 · 2020年2月13日

A Comparative Study on Transformer vs RNN in Speech Applications

A Comparative Study on Transformer vs RNN in Speech Applications

Arxiv

4+阅读 · 2019年9月13日

FastSpeech: Fast, Robust and Controllable Text to Speech

FastSpeech: Fast, Robust and Controllable Text to Speech

Arxiv

3+阅读 · 2019年5月22日

Neural Speech Synthesis with Transformer Network

Neural Speech Synthesis with Transformer Network

Arxiv

5+阅读 · 2019年1月30日

Music Transformer

Music Transformer

Arxiv

5+阅读 · 2018年12月12日

Improved Speech Enhancement with the Wave-U-Net

Arxiv

8+阅读 · 2018年11月27日

End-to-end Speech Recognition with Word-based RNN Language Models

End-to-end Speech Recognition with Word-based RNN Language Models

Arxiv

3+阅读 · 2018年8月8日

Speaker Recognition from raw waveform with SincNet

Arxiv

6+阅读 · 2018年7月29日

End-to-End Speech Recognition From the Raw Waveform

Arxiv

3+阅读 · 2018年6月19日

VIP会员

文章信息

相关主题

Neural Networks

卷积神经网络

相关VIP内容

【干货书】Python程序员编程，810页pdf，Python® for Programmers

【干货书】Python程序员编程，810页pdf，Python® for Programmers

专知会员服务

61+阅读 · 2020年8月6日

【Google】多模态Transformer视频检索，Multi-modal Transformer

【Google】多模态Transformer视频检索，Multi-modal Transformer

专知会员服务

103+阅读 · 2020年7月22日

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

【CVPR2020-小鹏汽车】判别性多模态语音识别, Discriminative Multi-modality SR

【CVPR2020-小鹏汽车】判别性多模态语音识别, Discriminative Multi-modality SR

专知会员服务

41+阅读 · 2020年5月13日

【InterSpeech2020】混合语音识别系统中的词汇扩展技术，Techniques for Vocabulary Expansion in Hybrid Speech Recognition Systems

【InterSpeech2020】混合语音识别系统中的词汇扩展技术，Techniques for Vocabulary Expansion in Hybrid Speech Recognition Systems

专知会员服务

17+阅读 · 2020年3月23日

【微软雷德蒙研究院】小样本自然语言生成，Few-shot Natural Language Generation for Task-Oriented Dialog

【微软雷德蒙研究院】小样本自然语言生成，Few-shot Natural Language Generation for Task-Oriented Dialog

专知会员服务

33+阅读 · 2020年2月29日

【上海交大-ICASSP2020】Transformer端到端的多说话人语音识别

【上海交大-ICASSP2020】Transformer端到端的多说话人语音识别

专知会员服务

51+阅读 · 2020年2月16日

Transformer文本分类代码

Transformer文本分类代码

专知会员服务

118+阅读 · 2020年2月3日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

《基于大型语言模型的软件工程自动化研究》最新264页

《基于大型语言模型的信号处理管线研究：推进军事电子情报工作流程》最新76页

中文版 | 战争算法：生成式人工智能在战场的崛起

中文版《美国陆军：战术行为性远程医疗实施观察与建议》

相关资讯

【资源】语音增强资源集锦

【资源】语音增强资源集锦

专知

8+阅读 · 2020年7月4日

Interspeech 2019 | 从顶会看语音技术的发展趋势

Interspeech 2019 | 从顶会看语音技术的发展趋势

AI科技评论

16+阅读 · 2019年9月19日

Interspeech 2019 | 阿里达摩院语音实验室：联合CTC和Transformer的自动中文纠错模型

Interspeech 2019 | 阿里达摩院语音实验室：联合CTC和Transformer的自动中文纠错模型

机器之心

26+阅读 · 2019年9月15日

已删除

将门创投

4+阅读 · 2018年6月12日

隔离型双向DC/DC变换器：拓扑结构与控制方法

隔离型双向DC/DC变换器：拓扑结构与控制方法

NE电气

4+阅读 · 2018年4月24日

相关论文

SBAT: Video Captioning with Sparse Boundary-Aware Transformer

Arxiv

4+阅读 · 2020年7月23日

End-to-End Multi-speaker Speech Recognition with Transformer

Arxiv

8+阅读 · 2020年2月13日

A Comparative Study on Transformer vs RNN in Speech Applications

A Comparative Study on Transformer vs RNN in Speech Applications

Arxiv

4+阅读 · 2019年9月13日

FastSpeech: Fast, Robust and Controllable Text to Speech

FastSpeech: Fast, Robust and Controllable Text to Speech

Arxiv

3+阅读 · 2019年5月22日

Neural Speech Synthesis with Transformer Network

Neural Speech Synthesis with Transformer Network

Arxiv

5+阅读 · 2019年1月30日

Music Transformer

Music Transformer

Arxiv

5+阅读 · 2018年12月12日

Improved Speech Enhancement with the Wave-U-Net

Arxiv

8+阅读 · 2018年11月27日

End-to-end Speech Recognition with Word-based RNN Language Models

End-to-end Speech Recognition with Word-based RNN Language Models

Arxiv

3+阅读 · 2018年8月8日

Speaker Recognition from raw waveform with SincNet

Arxiv

6+阅读 · 2018年7月29日

End-to-End Speech Recognition From the Raw Waveform

Arxiv

3+阅读 · 2018年6月19日

微信扫码咨询专知VIP会员