Recent years have witnessed remarkable progress in artificial intelligence (AI) thanks to refined deep network structures, powerful computing devices, and large-scale labeled datasets. However, researchers have mainly invested in the optimization of models and computational devices, leading to the fact that good models and powerful computing devices are currently readily available, while datasets are still stuck at the initial stage of large-scale but low quality. Data becomes a major obstacle to AI development. Taking note of this, we dig deeper and find that there has been some but unstructured work on data optimization. They focus on various problems in datasets and attempt to improve dataset quality by optimizing its structure to facilitate AI development. In this paper, we present the first review of recent advances in this area. First, we summarize and analyze various problems that exist in large-scale computer vision datasets. We then define data optimization and classify data optimization algorithms into three directions according to the optimization form: data sampling, data subset selection, and active learning. Next, we organize these data optimization works according to data problems addressed, and provide a systematic and comparative description. Finally, we summarize the existing literature and propose some potential future research topics.
翻译:近些年来,由于完善了深层次的网络结构、强大的计算装置和有标签的大型数据集,人工智能(AI)取得了显著进展,然而,研究人员主要投资于优化模型和计算装置,导致目前随时可以找到良好的模型和强大的计算装置,而数据集仍然处于大规模但质量低的初始阶段。数据成为AI发展的主要障碍。我们注意到这一点,深入挖掘发现在数据优化方面有些工作虽然没有结构化,但有一些问题,并试图通过优化数据集的结构来改进数据集的质量,以促进AI的发展。我们在本文件中首次审查了这一领域最近的进展。首先,我们总结和分析了大规模计算机视觉数据集中存在的各种问题。我们随后根据优化格式界定数据优化并将数据优化算法分为三个方向:数据取样、数据子集选择和积极学习。我们接着根据所处理的数据问题组织这些数据优化工作,并提供系统和比较性描述。我们总结了现有文献,并提出一些未来可能的研究专题。