使用声音事件探测的音频字幕 (Audio Captioning Using Sound Event Detection) - 专知论文

会员服务 ·

0

词向量表示 · Better · AIM · GRU · MoDELS ·

2021 年 10 月 4 日

Audio Captioning Using Sound Event Detection

翻译：使用声音事件探测的音频字幕

Ayşegül Özkaya Eren,Mustafa Sert

This technical report proposes an audio captioning system for DCASE 2021 Task 6 audio captioning challenge. Our proposed model is based on an encoder-decoder architecture with bi-directional Gated Recurrent Units (BiGRU) using pretrained audio features and sound event detection. A pretrained neural network (PANN) is used to extract audio features and Word2Vec is selected with the aim of extracting word embeddings from the audio captions. To create semantically meaningful captions, we extract sound events from the audio clips and feed the encoder-decoder architecture with sound events in addition to PANNs features. Our experiments on the Clotho dataset show that our proposed method significantly achieves better results than the challenge baseline model across all evaluation metrics.

翻译：本技术报告提议为DCASE 2021 任务6 音频说明挑战建立一个音频说明系统。我们提议的模型基于一个有双向Gated 常务单位(BIGRU)的编码器解码器结构,使用预先训练的音频特征和音效事件探测。一个预先训练的神经网络(PANN)用于提取音频特征,而Word2Vec则被选中,目的是从音频字幕中提取文字嵌入。要创建具有语义意义的字幕,我们从音频剪中提取音频事件,除PANNs功能外,还用声音事件为编码器解码器结构提供材料。我们在Clootho数据集上的实验显示,我们拟议的方法取得了比所有评价指标的挑战基线模型更好的效果。

0

相关内容

词向量表示

词向量表示

分散式表示即将语言表示为稠密、低维、连续的向量。研究者最早发现学习得到词嵌入之间存在类比关系。比如apple−apples ≈ car−cars， man−woman ≈ king – queen 等。这些方法都可以直接在大规模无标注语料上进行训练。词嵌入的质量也非常依赖于上下文窗口大小的选择。通常大的上下文窗口学到的词嵌入更反映主题信息，而小的上下文窗口学到的词嵌入更反映词的功能和上下文语义信息。

【商汤科技】可变形Transformers端到端对象检测，Deformable DETR

【商汤科技】可变形Transformers端到端对象检测，Deformable DETR

专知会员服务

33+阅读 · 2020年10月11日

【CVPR2020】语义增强的场景文本识别的编码-解码器框架，SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition

【CVPR2020】语义增强的场景文本识别的编码-解码器框架，SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition

专知会员服务

25+阅读 · 2020年5月22日

【视频描述综述论文】Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

【视频描述综述论文】Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

专知会员服务

65+阅读 · 2020年5月12日

【北卡罗莱纳州立大学】单场景视频异常检测综述，A Survey of Single-Scene Video Anomaly Detection

【北卡罗莱纳州立大学】单场景视频异常检测综述，A Survey of Single-Scene Video Anomaly Detection

专知会员服务

31+阅读 · 2020年4月13日

【ACL2020-Google】学习鲁棒度量的文本生成，BLEURT: Learning Robust Metrics for Text Generation

【ACL2020-Google】学习鲁棒度量的文本生成，BLEURT: Learning Robust Metrics for Text Generation

专知会员服务

17+阅读 · 2020年4月10日

Video Description视频描述综述论文-方法、数据集和评估指标，UWA

Video Description视频描述综述论文-方法、数据集和评估指标，UWA

专知会员服务

39+阅读 · 2020年3月5日

【AAAI2020】实体关系联合抽取的编码器-解码器结构的有效建模（ Effective Modeling of Encoder-Decoder Architecture for Joint Entity and Relation Extraction）

【AAAI2020】实体关系联合抽取的编码器-解码器结构的有效建模（ Effective Modeling of Encoder-Decoder Architecture for Joint Entity and Relation Extraction）

专知会员服务

53+阅读 · 2019年11月22日

【目标检测 | 2019最新综述】目标检测的20年，附39页PDF，Object Detection in 20 Years: A Survey

【目标检测 | 2019最新综述】目标检测的20年，附39页PDF，Object Detection in 20 Years: A Survey

专知会员服务

60+阅读 · 2019年11月15日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

已删除

将门创投

5+阅读 · 2019年5月5日

Points2Sound: From mono to binaural audio using 3D point cloud scenes

Arxiv

0+阅读 · 2021年11月25日

Polyphonic Sound Event Detection Using Capsule Neural Network on Multi-Type-Multi-Scale Time-Frequency Representation

Arxiv

0+阅读 · 2021年11月25日

Transformer Tracking

Arxiv

17+阅读 · 2021年3月29日

A sequential guiding network with attention for image captioning

A sequential guiding network with attention for image captioning

Arxiv

5+阅读 · 2019年2月8日

Improved Speech Enhancement with the Wave-U-Net

Arxiv

8+阅读 · 2018年11月27日

Jointly Localizing and Describing Events for Dense Video Captioning

Arxiv

5+阅读 · 2018年4月23日

Reconstruction Network for Video Captioning

Arxiv

5+阅读 · 2018年3月30日

Where to put the Image in an Image Caption Generator

Arxiv

3+阅读 · 2018年3月14日

Image Captioning using Deep Neural Architectures

Arxiv

20+阅读 · 2018年1月17日

Integrating both Visual and Audio Cues for Enhanced Video Caption

Arxiv

4+阅读 · 2017年12月9日

VIP会员

文章信息

相关主题

词向量表示

相关VIP内容

【商汤科技】可变形Transformers端到端对象检测，Deformable DETR

【商汤科技】可变形Transformers端到端对象检测，Deformable DETR

专知会员服务

33+阅读 · 2020年10月11日

【CVPR2020】语义增强的场景文本识别的编码-解码器框架，SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition

【CVPR2020】语义增强的场景文本识别的编码-解码器框架，SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition

专知会员服务

25+阅读 · 2020年5月22日

【视频描述综述论文】Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

【视频描述综述论文】Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

专知会员服务

65+阅读 · 2020年5月12日

【北卡罗莱纳州立大学】单场景视频异常检测综述，A Survey of Single-Scene Video Anomaly Detection

【北卡罗莱纳州立大学】单场景视频异常检测综述，A Survey of Single-Scene Video Anomaly Detection

专知会员服务

31+阅读 · 2020年4月13日

【ACL2020-Google】学习鲁棒度量的文本生成，BLEURT: Learning Robust Metrics for Text Generation

【ACL2020-Google】学习鲁棒度量的文本生成，BLEURT: Learning Robust Metrics for Text Generation

专知会员服务

17+阅读 · 2020年4月10日

Video Description视频描述综述论文-方法、数据集和评估指标，UWA

Video Description视频描述综述论文-方法、数据集和评估指标，UWA

专知会员服务

39+阅读 · 2020年3月5日

【AAAI2020】实体关系联合抽取的编码器-解码器结构的有效建模（ Effective Modeling of Encoder-Decoder Architecture for Joint Entity and Relation Extraction）

【AAAI2020】实体关系联合抽取的编码器-解码器结构的有效建模（ Effective Modeling of Encoder-Decoder Architecture for Joint Entity and Relation Extraction）

专知会员服务

53+阅读 · 2019年11月22日

【目标检测 | 2019最新综述】目标检测的20年，附39页PDF，Object Detection in 20 Years: A Survey

【目标检测 | 2019最新综述】目标检测的20年，附39页PDF，Object Detection in 20 Years: A Survey

专知会员服务

60+阅读 · 2019年11月15日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

《算法战争研究计划全景评估》35页

《分层多智能体系统分类：设计范式、协调机制与工业应用》最新28页

智能体战争：自主人工智能军备竞赛全景透视

《太空对抗中未知追踪者目标下的规避策略研究》122页

相关资讯

已删除

将门创投

5+阅读 · 2019年5月5日

相关论文

Points2Sound: From mono to binaural audio using 3D point cloud scenes

Arxiv

0+阅读 · 2021年11月25日

Polyphonic Sound Event Detection Using Capsule Neural Network on Multi-Type-Multi-Scale Time-Frequency Representation

Arxiv

0+阅读 · 2021年11月25日

Transformer Tracking

Arxiv

17+阅读 · 2021年3月29日

A sequential guiding network with attention for image captioning

A sequential guiding network with attention for image captioning

Arxiv

5+阅读 · 2019年2月8日

Improved Speech Enhancement with the Wave-U-Net

Arxiv

8+阅读 · 2018年11月27日

Jointly Localizing and Describing Events for Dense Video Captioning

Arxiv

5+阅读 · 2018年4月23日

Reconstruction Network for Video Captioning

Arxiv

5+阅读 · 2018年3月30日

Where to put the Image in an Image Caption Generator

Arxiv

3+阅读 · 2018年3月14日

Image Captioning using Deep Neural Architectures

Arxiv

20+阅读 · 2018年1月17日

Integrating both Visual and Audio Cues for Enhanced Video Caption

Arxiv

4+阅读 · 2017年12月9日

微信扫码咨询专知VIP会员