由事件组成的多声声音事件定位和探测独立网络 (Event-Independent Network for Polyphonic Sound Event Localization and Detection)

Polyphonic sound event localization and detection is not only detecting what sound events are happening but localizing corresponding sound sources. This series of tasks was first introduced in DCASE 2019 Task 3. In 2020, the sound event localization and detection task introduces additional challenges in moving sound sources and overlapping-event cases, which include two events of the same type with two different direction-of-arrival (DoA) angles. In this paper, a novel event-independent network for polyphonic sound event localization and detection is proposed. Unlike the two-stage method we proposed in DCASE 2019 Task 3, this new network is fully end-to-end. Inputs to the network are first-order Ambisonics (FOA) time-domain signals, which are then fed into a 1-D convolutional layer to extract acoustic features. The network is then split into two parallel branches. The first branch is for sound event detection (SED), and the second branch is for DoA estimation. There are three types of predictions from the network, SED predictions, DoA predictions, and event activity detection (EAD) predictions that are used to combine the SED and DoA features for on-set and off-set estimation. All of these predictions have the format of two tracks indicating that there are at most two overlapping events. Within each track, there could be at most one event happening. This architecture introduces a problem of track permutation. To address this problem, a frame-level permutation invariant training method is used. Experimental results show that the proposed method can detect polyphonic sound events and their corresponding DoAs. Its performance on the Task 3 dataset is greatly increased as compared with that of the baseline method.

翻译：聚合声音事件本地化和检测不仅发现正在发生什么声音事件,而且将相应的声音源本地化。这一系列任务首先在DCASE 2019任务3中引入。2020年,声音事件本地化和检测任务在移动声源和重叠事件方面带来了额外的挑战,其中包括两个同类事件,有两个不同的抵达方向(DoA)角度。在本文中,提出了一个新的多声事件独立网络本地化和检测。与我们在DCASE 2019任务3中提议的双轨方法不同,这个新网络是完全端对端的。对网络的输入是第一级 Ambisonics(FOA) 时间- domain 信号和重叠事件,然后将其输入到一个 1 - Duphal 层以提取音频特性。然后将网络分成两个平行分支。第一个分支是声音事件检测(SED),第二个分支可以用于 DoA 估计。从网络、 SED 预测、 DoA 预报和事件测试(ED) 级一级输入活动定位框架的输入为一级, 将SDA 的每个轨道的预估测算结果作为两个轨道的预算。

相关内容

Networking

关注 22

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日