We present a novel personalized voice activity detection (PVAD) learning method that does not require enrollment data during training. PVAD is a task to detect the speech segments of a specific target speaker at the frame level using enrollment speech of the target speaker. Since PVAD must learn speakers' speech variations to clarify the boundary between speakers, studies on PVAD used large-scale datasets that contain many utterances for each speaker. However, the datasets to train a PVAD model are often limited because substantial cost is needed to prepare such a dataset. In addition, we cannot utilize the datasets used to train the standard VAD because they often lack speaker labels. To solve these problems, our key idea is to use one utterance as both a kind of enrollment speech and an input to the PVAD during training, which enables PVAD training without enrollment speech. In our proposed method, called enrollment-less training, we augment one utterance so as to create variability between the input and the enrollment speech while keeping the speaker identity, which avoids the mismatch between training and inference. Our experimental results demonstrate the efficacy of the method.
翻译:我们提出了一种新的个性化语音活动检测(PVAD)学习方法,在培训期间不需要注册数据。 PVAD是一项任务,用目标发言人的注册语言来检测一个特定目标发言者在框架级别的演讲部分。 由于PVAD必须学习演讲者的语言变异以澄清发言者之间的界限, 有关PVAD的研究使用大型数据集, 其中包括每个发言者的许多发音。 然而, 培训PVAD模式的数据集往往有限, 因为编制这样的数据集需要大量费用。 此外, 我们无法使用用于培训标准 VAD的数据集, 因为他们往往缺乏演讲者标签。 为了解决这些问题, 我们的关键想法是使用一种发音, 既作为一种注册语言,又作为培训期间对PVAD的投入, 从而使得PVAD培训无需注册语言。 在我们建议的方法中, 称为无注册培训, 我们增加一种发音, 以便在输入和注册演讲之间创造差异, 同时保持发言者身份, 避免培训与推论之间的不匹配。 我们的实验结果展示了方法的功效。