The use of synthetic data for training computer vision algorithms has become increasingly popular due to its cost-effectiveness, scalability, and ability to provide accurate multi-modality labels. Although recent studies have demonstrated impressive results when training networks solely on synthetic data, there remains a performance gap between synthetic and real data that is commonly attributed to lack of photorealism. The aim of this study is to investigate the gap in greater detail for the face parsing task. We differentiate between three types of gaps: distribution gap, label gap, and photorealism gap. Our findings show that the distribution gap is the largest contributor to the performance gap, accounting for over 50% of the gap. By addressing this gap and accounting for the labels gap, we demonstrate that a model trained on synthetic data achieves comparable results to one trained on a similar amount of real data. This suggests that synthetic data is a viable alternative to real data, especially when real data is limited or difficult to obtain. Our study highlights the importance of content diversity in synthetic datasets and challenges the notion that the photorealism gap is the most critical factor affecting the performance of computer vision models trained on synthetic data.
翻译:由于其成本效益、可扩展性和能够提供准确的多模态标签的特点,使用合成数据来训练计算机视觉算法变得越来越流行。尽管最近的研究表明,仅使用合成数据训练网络时可以获得令人印象深刻的结果,但合成数据和真实数据之间仍然存在性能差距,这通常归因于缺乏照片真实性。这项研究的目的是更详细地调查面部分隔任务的差距。我们区分了三种类型的差距:分布差距、标签差距和照片真实性差距。我们的研究结果表明,分布差距是性能差距的最大贡献者,占差距的50%以上。通过解决这个差距并考虑标签差距,我们证明了在合成数据上训练的模型可以获得与在类似数量的真实数据上训练的模型相当的结果。这表明,在真实数据有限或难以获得时,合成数据是一个可行的替代方法。我们的研究强调了合成数据集的内容多样性的重要性,并挑战了照片真实性差距是影响在合成数据上训练的计算机视觉模型性能的最关键因素的观点。