关于计算机远景数据状况的数据:发展深学习模式所不可或缺的人说明 (On The State of Data In Computer Vision: Human Annotations Remain Indispensable for Developing Deep Learning Models)

High-quality labeled datasets play a crucial role in fueling the development of machine learning (ML), and in particular the development of deep learning (DL). However, since the emergence of the ImageNet dataset and the AlexNet model in 2012, the size of new open-source labeled vision datasets has remained roughly constant. Consequently, only a minority of publications in the computer vision community tackle supervised learning on datasets that are orders of magnitude larger than Imagenet. In this paper, we survey computer vision research domains that study the effects of such large datasets on model performance across different vision tasks. We summarize the community's current understanding of those effects, and highlight some open questions related to training with massive datasets. In particular, we tackle: (a) The largest datasets currently used in computer vision research and the interesting takeaways from training on such datasets; (b) The effectiveness of pre-training on large datasets; (c) Recent advancements and hurdles facing synthetic datasets; (d) An overview of double descent and sample non-monotonicity phenomena; and finally, (e) A brief discussion of lifelong/continual learning and how it fares compared to learning from huge labeled datasets in an offline setting. Overall, our findings are that research on optimization for deep learning focuses on perfecting the training routine and thus making DL models less data hungry, while research on synthetic datasets aims to offset the cost of data labeling. However, for the time being, acquiring non-synthetic labeled data remains indispensable to boost performance.

翻译：高品质的标签数据集在推动机器学习(ML),特别是深层次学习(DL)的开发方面发挥着关键作用。然而,自2012年出现图像网络数据集和亚历克斯Net模型以来,新的开放源代码标签的视觉数据集的规模仍然大致稳定。因此,计算机视觉界只有少数出版物处理监督地学习规模大于图像网的数据集的问题。在本文中,我们调查了研究这种大型数据集对不同愿景任务模型性能的影响的计算机愿景研究领域。我们总结了社区目前对这些影响的了解,并着重指出了与大规模数据集培训有关的一些开放问题。特别是,我们处理了:(a)目前在计算机视觉研究中使用的最大公开源标签的视觉数据集的规模和从这类数据集培训中获得的有趣收获;(b)大型数据集培训前的效力。 (c) 合成数据集面临的最新进展和障碍;(d) 抵消了双重来源和抽样非感知性现象的概览;以及最后,我们处理的是:(a) 目前计算机视觉研究中使用的最大数据集,从而进行长期性能学习。