We learn a visual representation that captures information about the camera that recorded a given photo. To do this, we train a multimodal embedding between image patches and the EXIF metadata that cameras automatically insert into image files. Our model represents this metadata by simply converting it to text and then processing it with a transformer. The features that we learn significantly outperform other self-supervised and supervised features on downstream image forensics and calibration tasks. In particular, we successfully localize spliced image regions "zero shot" by clustering the visual embeddings for all of the patches within an image.
翻译:我们学习一种视觉表示方法,该方法捕捉了拍摄给定照片的相机的信息。为此,我们训练了一个图像块和相机自动插入图像文件中的 EXIF 元数据之间的多模态嵌入。我们的模型将这些元数据表示为简单的文本,并通过转换器进行处理。我们学到的特征在下游的图像取证和校准任务中显著优于其他自监督和监督特征。特别是,我们通过聚类图像中所有补丁的视觉嵌入,成功地“零-shot”定位了被拼接的图像区域。