User-defined keyword spotting is a task to detect new spoken terms defined by users. This can be viewed as a few-shot learning problem since it is unreasonable for users to define their desired keywords by providing many examples. To solve this problem, previous works try to incorporate self-supervised learning models or apply meta-learning algorithms. But it is unclear whether self-supervised learning and meta-learning are complementary and which combination of the two types of approaches is most effective for few-shot keyword discovery. In this work, we systematically study these questions by utilizing various self-supervised learning models and combining them with a wide variety of meta-learning algorithms. Our result shows that HuBERT combined with Matching network achieves the best result and is robust to the changes of few-shot examples.
翻译:用户定义的关键字定位是一项探测用户定义的新语音术语的任务。 这可被视为一个微小的学习问题, 因为用户通过提供许多实例来定义他们想要的关键字是不合理的。 为了解决这个问题, 先前的工作试图纳入自监管的学习模式或应用元学习算法。 但尚不清楚自监管的学习和元学习是否是互补的, 以及这两种方法的组合对少数关键字的发现最为有效。 在这项工作中, 我们系统地研究这些问题, 利用各种自监管的学习模式, 并把它们与各种各样的元学习算法相结合。 我们的结果显示, HuBERT 与匹配网络的结合取得了最佳效果, 并且对少数例子的变化非常有效 。