For natural language processing 'text-to-text' tasks, the prevailing approaches heavily rely on pretraining large self-supervised models on massive external data sources, which incurs exceptional pretraining data requirements and a diminished ability to pretrain over small datasets. However, fundamental pretraining method capabilities like few to zero-shot learning or preserving minority concept (long-tail) prediction performance along with accordingly designed evaluation scenarios remain open challenges. We thus propose Contrastive Label-Embedding Self-Supervision (CLESS) pretraining, which enables pretraining from multiple magnitudes smaller, 'task internal' data only, while still strongly improving fully supervised, long-tail, few-shot and self-supervised zero-shot learning abilities. Accordingly, we analyse improvements in learning dynamics over baselines on a challenging long-tailed, low-resource, multi-label text classification scenario with noisy, highly sparse labels and many minority concepts. We find that long-tailed zero and few-shot learning markedly benefit from increasing 'dataset-internal' self-supervised pretraining signals, to help reduce the reliance on large external sources.
翻译:对于自然语言处理“文本到文本”的任务,普遍的做法在很大程度上依赖于对大型外部数据源的大型自监督模型进行预先培训,这需要特殊的培训前数据要求,对小型数据集的预先培训能力降低,然而,基本的培训前方法能力,如少到零的学习能力或维护少数概念(长尾)的预测性能,以及相应设计的评估假想,仍然是尚未解决的挑战。因此,我们提出了反标签和自闭自视预培训(CLESS)预培训(CLES)方案,使培训前能够利用规模较小的多层次的“任务内部”数据进行预培训,同时继续大力改进完全监督下的、长尾的、少发的和自上手的零弹学习能力。因此,我们分析了在具有挑战性的长期尾尾、低资源、多标签分类情景和许多少数民族概念的基线上学习动态的改进情况。我们发现,增加“内部数据集”自我监督的自我监督前信号,对长期零和少发的学习有明显的好处,有助于减少对大型外部来源的依赖。