Unsupervised anomaly detection and localization is a crucial task as it is impossible to collect and label all possible anomalies. Many studies have emphasized the importance of integrating local and global information to achieve accurate segmentation of anomalies. To this end, there has been a growing interest in Transformer, which allows modeling long-range content interactions. However, global interactions through self attention are generally too expensive for most image scales. In this study, we introduce HaloAE, the first auto-encoder based on a local 2D version of Transformer with HaloNet. With HaloAE, we have created a hybrid model that combines convolution and local 2D block-wise self-attention layers and jointly performs anomaly detection and segmentation through a single model. We achieved competitive results on the MVTec dataset, suggesting that vision models incorporating Transformer could benefit from a local computation of the self-attention operation, and pave the way for other applications.
翻译:由于无法收集和标注所有可能的异常现象,因此未经监督的异常现象探测和本地化是一项关键任务。许多研究都强调了整合当地和全球信息以实现异常现象准确分解的重要性。为此,人们越来越关注能够模拟远程内容互动的变异器。然而,通过自我关注进行的全球互动对于大多数图像尺度而言通常过于昂贵。在本研究中,我们引入了HaloAE(HaloAE),这是第一个基于与HaloNet的本地2D变异器的自动编码器。与HaloAE(HaloAE)一起,我们创建了一种混合模型,将聚合和本地的2D分块自控层结合起来,并通过一个单一模型共同进行异常检测和分解。我们在MVTec数据集上取得了竞争性结果,表明纳入变异器的视觉模型可以从本地自控操作计算中受益,并为其他应用铺平道路。