Image anomaly detection consists in detecting images or image portions that are visually different from the majority of the samples in a dataset. The task is of practical importance for various real-life applications like biomedical image analysis, visual inspection in industrial production, banking, traffic management, etc. Most of the current deep learning approaches rely on image reconstruction: the input image is projected in some latent space and then reconstructed, assuming that the network (mostly trained on normal data) will not be able to reconstruct the anomalous portions. However, this assumption does not always hold. We thus propose a new model based on the Vision Transformer architecture with patch masking: the input image is split in several patches, and each patch is reconstructed only from the surrounding data, thus ignoring the potentially anomalous information contained in the patch itself. We then show that multi-resolution patches and their collective embeddings provide a large improvement in the model's performance compared to the exclusive use of the traditional square patches. The proposed model has been tested on popular anomaly detection datasets such as MVTec and head CT and achieved good results when compared to other state-of-the-art approaches.
翻译:图像异常点的探测包括探测图像或图像部分,这些图像或图像部分与数据集中的大多数样本有视觉上的不同。任务对于生物医学图像分析、工业生产、银行业务、交通管理等的视觉检查等各种现实生活中的应用具有实际重要性。 目前大部分深层次的学习方法都依赖于图像重建:输入图像是在一些隐蔽空间中投射的,然后进行重建,假设网络(大部分受过正常数据培训)将无法重建异常点部分。然而,这一假设并不始终有效。因此,我们提议了基于愿景变形器结构的新模型,并配有补丁:输入图像分为几个补丁,每个补丁仅从周围的数据中重建,从而忽略了补丁本身可能含有的异常点信息。我们随后表明,多分辨率补丁及其集体嵌入将大大改进模型的性能,而传统平方块的专有使用。提议模型已经用流行的异常点探测数据集进行了测试,如MVTec和头部CT,并且与其他状态方法相比,取得了良好的结果。