Deep unsupervised anomaly detection in brain magnetic resonance imaging offers a promising route to identify pathological deviations without requiring lesion-specific annotations. Yet, fragmented evaluations, heterogeneous datasets, and inconsistent metrics have hindered progress toward clinical translation. Here, we present a large-scale, multi-center benchmark of deep unsupervised anomaly detection for brain imaging. The training cohort comprised 2,976 T1 and 2,972 T2-weighted scans from healthy individuals across six scanners, with ages ranging from 6 to 89 years. Validation used 92 scans to tune hyperparameters and estimate unbiased thresholds. Testing encompassed 2,221 T1w and 1,262 T2w scans spanning healthy datasets and diverse clinical cohorts. Across all algorithms, the Dice-based segmentation performance varied between 0.03 and 0.65, indicating substantial variability. To assess robustness, we systematically evaluated the impact of different scanners, lesion types and sizes, as well as demographics (age, sex). Reconstruction-based methods, particularly diffusion-inspired approaches, achieved the strongest lesion segmentation performance, while feature-based methods showed greater robustness under distributional shifts. However, systematic biases, such as scanner-related effects, were observed for the majority of algorithms, including that small and low-contrast lesions were missed more often, and that false positives varied with age and sex. Increasing healthy training data yields only modest gains, underscoring that current unsupervised anomaly detection frameworks are limited algorithmically rather than by data availability. Our benchmark establishes a transparent foundation for future research and highlights priorities for clinical translation, including image native pretraining, principled deviation measures, fairness-aware modeling, and robust domain adaptation.
翻译:脑磁共振成像中的深度无监督异常检测为识别病理偏差提供了一条无需病灶特异性标注的可行路径。然而,零散的评估、异构的数据集以及不一致的度量标准阻碍了其向临床转化的进展。本文提出一个面向脑影像的大规模、多中心深度无监督异常检测基准。训练队列包含来自六台扫描仪的2,976例T1加权和2,972例T2加权健康个体扫描,年龄范围覆盖6至89岁。验证阶段使用92例扫描调整超参数并估计无偏阈值。测试集涵盖2,221例T1w和1,262例T2w扫描,包含健康数据集及多样化的临床队列。在所有算法中,基于Dice系数的分割性能介于0.03至0.65之间,显示出显著差异性。为评估鲁棒性,我们系统性地考察了不同扫描仪、病灶类型与尺寸以及人口统计学特征(年龄、性别)的影响。基于重建的方法,尤其是受扩散模型启发的技术,取得了最优的病灶分割性能;而基于特征的方法在分布偏移下表现出更强的鲁棒性。然而,多数算法存在系统性偏差,例如与扫描仪相关的效应,具体表现为小尺寸与低对比度病灶更易被遗漏,且假阳性率随年龄与性别变化。增加健康训练数据仅带来有限提升,表明当前无监督异常检测框架的局限主要源于算法本身而非数据可用性。本基准为未来研究建立了透明的基础,并明确了临床转化的优先方向,包括图像原生预训练、基于原理的偏差度量、公平感知建模以及鲁棒的域适应方法。