Understanding geometric concepts, such as distance and shape, is essential for understanding the real world and also for many vision tasks. To incorporate such information into a visual representation of a scene, we propose learning to represent the scene by sketching, inspired by human behavior. Our method, coined Learning by Sketching (LBS), learns to convert an image into a set of colored strokes that explicitly incorporate the geometric information of the scene in a single inference step without requiring a sketch dataset. A sketch is then generated from the strokes where CLIP-based perceptual loss maintains a semantic similarity between the sketch and the image. We show theoretically that sketching is equivariant with respect to arbitrary affine transformations and thus provably preserves geometric information. Experimental results show that LBS substantially improves the performance of object attribute classification on the unlabeled CLEVR dataset, domain transfer between CLEVR and STL-10 datasets, and for diverse downstream tasks, confirming that LBS provides rich geometric information.
翻译:理解距离和形状等几何概念对于了解现实世界和许多视觉任务都至关重要。为了将这样的信息纳入场景的视觉表示中,我们提出了通过草图学习表示场景的方法,受人类行为的启发。我们的方法被称为草图学习(LBS),学习将图像转换为一组彩色笔画,这些笔画在单个推断步骤中明确地结合了场景的几何信息,而不需要草图数据集。然后从笔画中生成草图,其中基于 CLIP 的感知损失保持了草图和图像之间的语义相似性。我们理论上表明,草图具有任意仿射变换下的相等性,因此可证明地保留几何信息。实验结果表明,LBS显着改善了未标记的 CLEVR 数据集上对象属性分类的性能,在 CLEVR 和 STL-10 数据集之间进行域转移以及用于不同的下游任务,证实LBS提供丰富的几何信息。