Pretraining neural networks with massive unlabeled datasets have become popular as it equips the deep models with a better prior to solve downstream tasks. However, this approach generally assumes that for downstream tasks, we have access to annotated data of sufficient size. In this work, we propose ALOE, a novel system for improving the data- and label-efficiency of non-semantic speech tasks with active learning (AL). ALOE uses pre-trained models in conjunction with active learning to label data incrementally and learns classifiers for downstream tasks, thereby mitigating the need to acquire labeled data beforehand. We demonstrate the effectiveness of ALOE on a wide range of tasks, uncertainty-based acquisition functions, and model architectures. Training a linear classifier on top of a frozen encoder with ALOE is shown to achieve performance similar to several baselines that utilize the entire labeled data.
翻译:具有大量未贴标签数据集的神经预培训前网络随着在解决下游任务之前为深层模型提供更好的前置数据而变得很受欢迎。然而,这一方法一般假定,对于下游任务,我们可以获得数量足够多的附加说明数据。在这项工作中,我们提议ALOE,这是一个创新的系统,用积极学习来提高非静语任务的数据和标签效率。ALOE使用预先培训的模型,同时积极学习,对下游任务的数据进行递增标签和学习分类,从而减少了事先获取标签数据的必要性。我们展示了ALOE在一系列广泛任务、基于不确定性的获取功能和模型结构上的有效性。在与ALOE一起的冷冻编码器顶部培训线性分类师,其性能与使用整个标签数据的若干基线相似。