Advances in deep learning have resulted in state-of-the-art performance for many audio classification tasks but, unlike humans, these systems traditionally require large amounts of data to make accurate predictions. Not every person or organization has access to those resources, and the organizations that do, like our field at large, do not reflect the demographics of our country. Enabling people to use machine learning without significant resource hurdles is important, because machine learning is an increasingly useful tool for solving problems, and can solve a broader set of problems when put in the hands of a broader set of people. Few-shot learning is a type of machine learning designed to enable the model to generalize to new classes with very few examples. In this research, we address two audio classification tasks (speaker identification and activity classification) with the Prototypical Network few-shot learning algorithm, and assess performance of various encoder architectures. Our encoders include recurrent neural networks, as well as one- and two-dimensional convolutional neural networks. We evaluate our model for speaker identification on the VoxCeleb dataset and ICSI Meeting Corpus, obtaining 5-shot 5-way accuracies of 93.5% and 54.0%, respectively. We also evaluate for activity classification from audio using few-shot subsets of the Kinetics~600 dataset and AudioSet, both drawn from Youtube videos, obtaining 51.5% and 35.2% accuracy, respectively.
翻译:深层次学习的进展导致许多音频分类任务的先进性能,但是,与人类不同,这些系统传统上需要大量的数据才能作出准确的预测。并不是每个个人或组织都能获得这些资源,而像我们整个实地一样从事这项工作的组织并不反映我国的人口统计。使人们能够在没有重大资源障碍的情况下使用机器学习,这一点很重要,因为机器学习是解决问题的一个越来越有用的工具,当被更广大的一群人掌握时,可以解决更广泛的一系列问题。少发的学习是一种机器学习,目的是让模型能够以很少的例子将精度推广到新班级。在这项研究中,我们处理两种音频分类任务(语音识别和活动分类),而像我们这个网络那样的组织却不反映我们国家的人口结构。使人们能够在没有重大资源障碍的情况下使用机器学习工具来使用机器学习技术,因为机器学习是一个越来越有用的工具,当被更广大的一群人掌握时,就能解决更广泛的一系列问题。 我们评估了VoxCeleb数据集和ICSI会议所用的演讲模式,目的是用非常少的例子来将模型推广到新的模型。 我们分别从5-Shoetim 5 和Kincrocrocal 分类中获取了5 % 活动。