We learn a visual representation that captures information about the camera that recorded a given photo. To do this, we train a multimodal embedding between image patches and the EXIF metadata that cameras automatically insert into image files. Our model represents this metadata by simply converting it to text and then processing it with a transformer. The features that we learn significantly outperform other self-supervised and supervised features on downstream image forensics and calibration tasks. In particular, we successfully localize spliced image regions "zero shot" by clustering the visual embeddings for all of the patches within an image.
翻译:我们学习了一个可捕捉记录给定照片的相机信息的视觉表示。为此,我们训练了一个多模态嵌入,将图像贴片和相机自动插入图像文件中的 EXIF 元数据一起处理。我们的模型通过将元数据转换为文本,然后使用转换器对其进行处理来表示此元数据。我们学习的特征在下游图像取证和校准任务中明显优于其他自监督和有监督特征。特别地,我们成功地通过聚类图像中的所有贴片的视觉嵌入“零 Shot”地定位拼接的图像区域。