Building machine learning models for natural language understanding (NLU) tasks relies heavily on labeled data. Weak supervision has been proven valuable when large amount of labeled data is unavailable or expensive to obtain. Existing works studying weak supervision for NLU either mostly focus on a specific task or simulate weak supervision signals from ground-truth labels. It is thus hard to compare different approaches and evaluate the benefit of weak supervision without access to a unified and systematic benchmark with diverse tasks and real-world weak labeling rules. In this paper, we propose such a benchmark, named WALNUT (semi-WeAkly supervised Learning for Natural language Understanding Testbed), to advocate and facilitate research on weak supervision for NLU. WALNUT consists of NLU tasks with different types, including document-level and token-level prediction tasks. WALNUT is the first semi-weakly supervised learning benchmark for NLU, where each task contains weak labels generated by multiple real-world weak sources, together with a small set of clean labels. We conduct baseline evaluations on WALNUT to systematically evaluate the effectiveness of various weak supervision methods and model architectures. Our results demonstrate the benefit of weak supervision for low-resource NLU tasks and highlight interesting patterns across tasks. We expect WALNUT to stimulate further research on methodologies to leverage weak supervision more effectively. The benchmark and code for baselines are available at \url{aka.ms/walnut_benchmark}.
翻译:建立自然语言理解(NLU)的机器学习模型(NLU)的任务严重依赖标签数据。当大量标签数据无法获取或难以获取时,薄弱的监督证明是有价值的。现有的研究工作研究国家语言股监督薄弱的问题,主要侧重于具体任务,或模拟地真伪标签产生的薄弱监督信号。因此,很难比较不同的方法,评价监督薄弱的益处,而没有获得具有不同任务和现实世界薄弱标签规则的统一和系统基准。在本文中,我们提出了这样一个基准,名为WALNUT(半We-WeAkly监督自然语言理解测试床),以倡导和促进关于NLUT监管薄弱问题的研究。 WALNUT由不同类型的任务组成,包括文件级别和象征性级别的预测任务。WALNUT是国家语言股的第一个半有效监管的学习基准,其中每项任务都包含多个真实世界薄弱来源产生的薄弱标签,还有少量清洁标签。我们对WALNUUT进行基线评估,以便系统评估各种薄弱监督方法的有效性。WALNUT由我们为基准的低级基准任务和模型系统展示。