Onomatopoeia, which is a character sequence that phonetically imitates a sound, is effective in expressing characteristics of sound such as duration, pitch, and timbre. We propose an environmental-sound-extraction method using onomatopoeia to specify the target sound to be extracted. With this method, we estimate a time-frequency mask from an input mixture spectrogram and onomatopoeia by using U-Net architecture then extract the corresponding target sound by masking the spectrogram. Experimental results indicate that the proposed method can extract only the target sound corresponding to onomatopoeia and performs better than conventional methods that use sound-event classes to specify the target sound.
翻译:Onomatopoieia 是音效仿制声音的字符序列, 有效表达声音的特性, 如持续时间、 音调和音调。 我们建议使用 opotopoieia 来指定要提取的目标声音 。 使用这个方法, 我们用 U- Net 结构来估计输入混合光谱和 Oomatopoieia 的时间频率遮罩, 然后通过遮盖光谱提取相应的目标声音 。 实验结果显示, 拟议方法只能提取与 Onotopoieia 相对应的目标声音, 并且比常规方法更好, 使用 声音- 活动分类来指定目标声音 。