Some studies have revealed that contexts of scenes (e.g., "home," "office," and "cooking") are advantageous for sound event detection (SED). Mobile devices and sensing technologies give useful information on scenes for SED without the use of acoustic signals. However, conventional methods can employ pre-defined contexts in inference stages but not undefined contexts. This is because one-hot representations of pre-defined scenes are exploited as prior contexts for such conventional methods. To alleviate this problem, we propose scene-informed SED where pre-defined scene-agnostic contexts are available for more accurate SED. In the proposed method, pre-trained large-scale language models are utilized, which enables SED models to employ unseen semantic contexts of scenes in inference stages. Moreover, we investigated the extent to which the semantic representation of scene contexts is useful for SED. Experimental results performed with TUT Sound Events 2016/2017 and TUT Acoustic Scenes 2016/2017 datasets show that the proposed method improves micro and macro F-scores by 4.34 and 3.13 percentage points compared with conventional Conformer- and CNN--BiGRU-based SED, respectively.
翻译:一些研究显示,场景背景(例如“家”、“办公室”和“烹饪”)有利于对事件进行正确检测(SED)。移动设备和遥感技术在SED场景上提供有用的信息,而没有使用声波信号。然而,常规方法可以在推论阶段使用预先界定的背景,但并非未定义的背景。这是因为将预先界定的场景的一热表示作为此类常规方法的先前背景加以利用。为了缓解这一问题,我们提议采用事先界定的场景认知环境为更准确的SED提供现场知情的SED。在拟议方法中,使用预先培训的大型语言模型,使SED模型能够使用无法见的推断阶段场景的语义背景。此外,我们调查了场景的语义描述在多大程度上对SED有用。与TUT Sound Sound Cit 2016/2017和TUT Acoucistic Scenes 2016/2017的实验结果显示,拟议的方法比常规CON-G-SG-RUB和MS-G-RUB分别改善了4.34和3.13%的百分比点。