Learning image representations using synthetic data allows training neural networks without some of the concerns associated with real images, such as privacy and bias. Existing work focuses on a handful of curated generative processes which require expert knowledge to design, making it hard to scale up. To overcome this, we propose training with a large dataset of twenty-one thousand programs, each one generating a diverse set of synthetic images. These programs are short code snippets, which are easy to modify and fast to execute using OpenGL. The proposed dataset can be used for both supervised and unsupervised representation learning, and reduces the gap between pre-training with real and procedurally generated images by 38%.
翻译:使用合成数据进行学习图像表征,可以培训神经网络,而不涉及隐私和偏见等真实图像的某些关切。现有工作侧重于少数需要专家知识才能设计、因而难以扩大规模的成熟基因化过程。为了克服这一点,我们提议以一个由2万1千个程序组成的大型数据集进行培训,每个程序产生一套不同的合成图像。这些方案是简易的代码片段,很容易修改,而且使用 OpenGL 快速执行。 提议的数据集可用于有监督和非监督的演示学习,并缩小与38%的实际和程序生成图像培训前之间的差距。