With promising yet saturated results in high-resource settings, low-resource datasets have gradually become popular benchmarks for evaluating the learning ability of advanced neural networks (e.g., BigBench, superGLUE). Some models even surpass humans according to benchmark test results. However, we find that there exists a set of hard examples in low-resource settings that challenge neural networks but are not well evaluated, which causes over-estimated performance. We first give a theoretical analysis on which factors bring the difficulty of low-resource learning. It then motivate us to propose a challenging benchmark hardBench to better evaluate the learning ability, which covers 11 datasets, including 3 computer vision (CV) datasets and 8 natural language process (NLP) datasets. Experiments on a wide range of models show that neural networks, even pre-trained language models, have sharp performance drops on our benchmark, demonstrating the effectiveness on evaluating the weaknesses of neural networks. On NLP tasks, we surprisingly find that despite better results on traditional low-resource benchmarks, pre-trained networks, does not show performance improvements on our benchmarks. These results demonstrate that there are still a large robustness gap between existing models and human-level performance.
翻译:在高资源环境下,低资源数据集已逐渐成为评价先进神经网络学习能力的流行基准(例如,BigBench,超级GLUE)。根据基准测试结果,有些模型甚至超过了人类。然而,我们发现,在低资源环境中,存在一系列挑战神经网络但却没有很好评估的硬实例,导致高估计性能。我们首先对哪些因素导致低资源学习困难的问题进行理论分析,然后激励我们提出具有挑战性的硬基准,以更好地评价学习能力,其中包括11个数据集,包括3个计算机愿景数据集和8个自然语言流程。对一系列模型的实验表明,神经网络,即使是经过预先培训的语言模型,其性能都比我们的基准差,表明评价神经网络弱点的有效性。关于NLP的任务,我们惊讶地发现,尽管在传统的低资源基准、预先培训的网络上取得了更好的结果,但是这些结果并没有显示我们基准之间的业绩改进。这些结果表明,在人类基准之间仍然存在着巨大的差距。</s>